Markov Decision Processes 4
IEEE 802.3 CSMA/CD
4.8 SMC for General MDP
4.8.1 Using Learning Algorithms
Where nondeterministic choices in an MDP are not spurious, the maximum and minimum probabilities of reaching a state satisfying a certain state formulaφ are not the same. However, schedulers exist that lead to precisely those probab-ilities. In exhaustive model checking using e.g. value or policy iteration, they are (implicitly or explicitly) found by a fixpoint operation over the whole state space.
Using Reinforcement Learning
Henriques et al. [HMZ+12] instead propose the use of reinforcement learn-ing [SB98,CFHM07], an artificial intelligence technique, to improve an arbit-rary candidate probabilistic scheduler to make its decisions closer and closer to a maximising one. The algorithm works with probabilistic schedulers during the learning and improvement phase in order to explore more of the state space instead of completely ruling out certain transitions early on.
The improvement of schedulers is based on a measure of how “good” a transition is in reaching a target state. This is approximated by performing a number of simulation traces and observing which decisions actually do lead to target states. Subsequently, the scheduler is updated such that good transitions are chosen with a higher probability (but not with probability 1). After a num-ber of these improvement steps, a nonprobabilistic scheduler is computed by selecting the most likely decisions of the probabilistic candidate in each state, and a standard SMC analysis is performed using this scheduler to resolve the nondeterministic choices. The entire procedure is summarised in Algorithm21.
First of all, we observe that the procedure is only specified for the qualit-ative form of probabilistic reachability properties, P(φ) ≤ x (or equivalently with bound < x). This is because the scheduler that is being approximated is the one that maximises the reachability probability. When a scheduler S has been found that leads a standard SMC analysis (line8), e.g. using the SPRT of Algo-rithm8, to conclude that the property is false, we know—with the usual error bounds of the SMC algorithm used—that this scheduler is a counterexample to the bound x. However, the entire process does not give any guarantees about the optimality of that final scheduler. This means that we simply do not know at all how far the probability of reaching aφ-state in ind(M,S) is from the actual maximum probability. This is why it is convenient to primarily consider the qualitative form, and why the algorithm can only return probably true (a.k.a.
unknown) when a sufficiently large number t of “learned” schedulers does not lead to a violation of the bound.
Input: MDP M, property P(φ) ≤ x, d, t, L Output: JP(φ) ≤ xKM (false or probably true)
1 for i = 1 to t do
2 S:= RUni
3 for j = 1 to L do
4 f := evaluate transitions in ind(M,S)
5 S:= update probabilities in S based on f
6 end
7 S:= determinise S
8 if SMC for M and S returns false then return false
9 end
10 return probably true
Algorithm 21: SMC for general MDP with learning [HMZ+12]
Pitfalls A significant problem that is not taken into account in [HMZ+12], pointed out by Legay and Sedwards [LST14], is that the SMC analysis, using a test like the SPRT that may sometimes give the wrong answer, is performed several times until it returns false (or we give up after t tries). This leads to error accumulation: While a single invocation of SMC in line8incorrectly re-turns false with a certain probability, the probability of the entire Algorithm21 incorrectly returning false is higher. In fact, if the actual probability of reaching aφ-state is greater than zero, we are guaranteed to get the result false if we just iterate the outer loop of the algorithm often enough.
Other oversights include that this technique learns and improves memory-less schedulers, but it is specified for the verification of step-bounded proper-ties. For these, the assumption of Definition52that memoryless schedulers are sufficient to maximise or minimise the reachability probability does not hold.
An easy way to fix this problem would be to analyse unbounded properties instead, using cycle detection as outlined in Section3.2.3and used by all the simulatefunctions presented so far in this thesis.
Performance The memory usage of this approach clearly depends on how many scheduling decisions need to be stored by the computed schedulers. For each iteration of the outer loop of Algorithm21, the amount of memory used is thus given by the number of nondeterministic states encountered during the simulation runs performed for scheduler evaluation in line4. In the worst case, this is the number of states n of the MDP under study. However, as we have already seen for the POR- and confluence-based approaches in Section 4.7.3
and as Henriques et al. also show for other examples, this number can be much lower, but is in general highly dependent on the structure of the model. In terms of worst-case runtime, which occurs when the property is actually true (or t is large enough to incorrectly make the algorithm answer false), the parameters t and L are decisive as we need to perform up to t ∙ L standard SMC analyses.
Learning Framework with BRTDP and DQL
In a different attempt to make use of learning techniques for the analysis of MDP models, Brázdil et al. [BCC+14] recently proposed a “general frame-work” to apply different learning algorithms to this problem. They define a learning algorithm as one that iteratively generates paths via simulation and updates upper and lower bounds U and L for a value function
V ∈ S × (A × Dist(S)) → [0,1]
over state-transition pairs defined as V (s,ha,μi) = ∑s0∈Sμ(s0) ∙V(s0). V maps each state s0 to its value, which is the maximum probability of reaching aφ -state from s0. The algorithm terminates when |U(sinit,tr) − L(sinit,tr)| <εfor all outgoing transitions tr of the initial state sinit. The precisionεis given as an argument to the algorithm.
While the idea superficially appears similar to the reinforcement learning approach presented before, and in particular focuses on maximum reachability probabilities JPmax(φ)K again, there are crucial differences. For one, there is no separation between a learning and an evaluation phase; instead, path genera-tion and the improvement of the approximagenera-tions occur iteratively in alternagenera-tion, starting from safe bounds. This avoids the problem of repeated invocations causing error accumulation. Furthermore, both a lower and an upper bound are computed. Memory usage, however, remains similar in that the current approx-imations U and L need to be stored for every state-transition pair that is visited during the simulation runs. It is thus again highly dependent on the structure of the model. The same applies to runtime: The number of iterations performed is simply the number of iterations needed until the difference between U and L in the initial state is belowε. This, too, obviously depends on the model at hand.
The framework is instantiated in [BCC+14] using two concrete learning algorithms, bounded real-time dynamic programming (BRTDP) and delayed Q-learning (DQL), that require different a priori information about the analysed model’s state space and that deliver different confidence statements. We omit the details of how these algorithms work, which in particular involves some technical complexity to correctly handle MDP with multiple end components, and just summarise their requirements and the resulting error bounds:
Complete and limited information The BRTDP algorithm requires what the authors of [BCC+14] call “complete information” and which is mostly equival-ent to the assumptions we make in this thesis: that the transition function may be given as a program, but can be evaluated for any state at will, thereby giving access to the complete information about a state’s outgoing transitions and their target probability distributions. In addition, however, algorithms working in the complete information setting are also assumed to have access to global inform-ation about the MDP at hand, such as the number of states or the maximum number of outgoing transitions over all states. DQL, on the other hand, can work with “limited information”, which means that evaluating the transition function still gives the set of outgoing transitions, but the target probability dis-tributions are opaque and can only be sampled. In addition, upper bounds on the numbers of states and outgoing transitions are assumed to be known, as is a lower bound > 0 for the smallest probability of any branch in the entire MDP. In summary, this means that both learning algorithms need slightly more information than what we assume to be available for SMC. Still, global bounds like those required for DQL could be computed from e.g. the MODESTcode of an MDP model.
Error bounds As mentioned, the proposed learning framework does not ex-plicitly compute the number of iterations (and thus of simulation runs) cessary to achieve the desired precision, but instead iterates as long as ne-cessary. On termination, we get two values u = maxtr∈T(sinit)U(sinit,tr) and l = maxtr∈T(sinit)L(sinit,tr) that are possibly an upper resp. a lower bound for JPmax(φ)K, with |u − l| <ε. For BRTDP, we have the guarantee that the al-gorithm almost surely terminates with u and l indeed being correct bounds.
For DQL, the corresponding statement is similar to what the APMC method gives us (cf. Section3.2.3): The algorithm takes an extra parameterδ, and with probability at least 1 −δ, terminates with u and l being correct bounds.
The bits of global information about the MDP at hand that we mentioned above are required at various places inside the two algorithms to achieve cor-rectness and convergence; for example, the state space size is necessary to com-pute a valid “delay” for use inside DQL.
Exhaustive or statistical? In summary, the two learning algorithms presen-ted for this “framework” make for sound learning-based approaches to the ana-lysis of general MDP. However, their memory footprint and especially the fact that they require some global information about the MDP at hand (which also means that they cannot be applied to infinite-state models) mean that they de-part from the spirit of SMC in some way. It may appear more natural to view
Input: MDP M, state formulaφ, number of schedulers a, d,ε,δ Output: h ˆpmin,ˆpmaxi (two real numbers)
1 N := d−ln(1 − √M
1 −δ)/(2ε2)e
2 ˆpmin:= 1, ˆpmax:= 0
3 for i = 1 to M do
4 Si:= randomly selected scheduler for M
5 truecount := 0
6 for j = 1 to N do
7 res := simulate(ind(M,Si), d)
8 if res = unknown then return unknown
9 else if res = true then truecount = truecount + 1
10 end
11 ˆpi:= truecount/N
12 if ˆpmax< ˆpithen ˆpmax:= ˆpi 13 if ˆpmin> ˆpithen ˆpmin:= ˆpi 14 end
15 return h ˆpmin,ˆpmaxi
Algorithm 22: SMC for MDP with scheduler sampling (based on [LST14])
them as techniques that significantly improve exhaustive model checking by re-ducing the number of states that have to be explored. In fact, the experimental section of [BCC+14] focuses on the BRTDP technique in comparison to the exhaustive model checking implementation of PRISM, where indeed signific-ant speedups can be observed.