In the case of discounted model, realizable solutions in the form of switching and randomized strategies were constructed in [8, 9]. But if the discount factor α is zero, standard randomized strategies are not suﬃcient for solving optimal control problems, as demonstrated in Section 4.
In Section 2, we describe the model and the wide class of control strategies including the discussed cases 1-4 and their combinations, along with mixtures. In Section 3 we prove the main results about the realizability of the strategies. Definition 3 seems natural to introduce the concept of the realizability. Theorem 2 states that a strategy is realizable if and only if there is a random process equivalently representing it. Thus, existence of a process satisfying all the assertions of Definition 4 can be accepted as another natural definition of the realizability. Finally, for the constrained models with the total expected cost, we present in Section 4 the suﬃcient class of realizablestrategies, that is, Poisson-related strategies, and show in Section 5 how the tools developed for the discrete-time models can be used for solving continuous-time problems. Note that we investigate the undiscounted Borel model with arbitrarily unbounded transition and cost rates, with the possibility of explosion and with an arbitrary, not necessarily finite number of constraints. All this makes the current article diﬀerent from the similar works in the area.
Example 5.4. (A cash ﬂow model.) Consider a continuous-time controlled problem of cash ﬂow in an economic market, in which the amount of the cash is referred to as the state of cash ﬂow. Thus, the state space of the cash ﬂow is S = ( −∞ , + ∞ ). Given the current state of cash ﬂow x ∈ S, a control action a ∈ A(x) is performed by withdrawing money with the amount − a if a < 0 or taking a supply of money with the amount a for a ≥ 0. When the current state is x ∈ S and an action a ∈ A(x) is chosen, a reward r(x, a) is earned. In addition, the amount of cash x is assumed to keep invariable for an exponential-distributed random time with parameter k(x, a) ≥ 0, and then the cash ﬂow is supposed to jump to other states with the normal distribution N (x, 1). Therefore, the transition rates of cash ﬂow is represented by
The mathematical model of Markovdecisionprocesses (MDP) is suitable for modeling decision making in situations where the evolution of a system is partly probabilistic and partly controlled by strategic choices. To define the notion of an optimal strategy we need to associate costs (or equivalently, rewards) with an execution of an MDP. Traditionally, this is done by associating a numerical value with each action in each state, and the cost of a terminating execution is the sum of the costs of all the decisions made along the way. The optimization problem then is to minimize the expected cost (or equivalently, maximize the expected reward) over the set of all strategies. It is well known that there exists a memoryless optimal strategy, where-in the globally optimal decision at any step during an execution is a function of the current state, and the optimization problem can be solved in polynomial-time using linear programming .
The ANTG case study models a complex museum with a variety of collections. Due to the popularity of the museum, there are many visitors at the same time. Different visitors may have different preferences of arts. We assume the museum divides all collections into different categories which are separated into different rooms, and visitors can choose what they would like to visit and pay tickets according to their preferences. In order to obtain the best experience, a visitor can, prior to her visit, assign a predefined weight to each category denoting her preferences to the museum, and then design the best strategy for a visit. The problem with this approach is that the preference weights depend on many time-dependent factors such as price, weather, or the length of queue at that moment and are hard to compute in advance. In order to account for this, we allow uncertainties of preferences such that their values may lie in an interval and ask for the best strategies for a given museum. The solution in the form of the best policy or policies can then be used by the museum’s administration for fare design decisions or load analysis; for a visitor, the policies can serve as a decision support for optimal museum experience.
We now describe a polynomial-time reduction from quantitative parity to quantitative reachability preserving value vectors. The idea is to allow the strategy to irreversibly switch to an optimal strategy for environment i from any MEC of M i , and to represent this switch by a target absorbing state. Intuitively, the new reachability condition is equivalent to the parity objective for two reasons: first, all runs eventually enter an end-component and stay there, which roughly corresponds to this switch, and second, the transformation only adds new actions, so any strategy in the original MEMDP is still valid in the new one, and in particular learning strategies. It follows 1) a polynomial-time algorithm for the limit-sure parity problem, 2) any algorithm for quantitative reachability can be used to solve the quantitative parity problem. In particular, results of Section 7 applies to parity.
schema, that is each instance is repeated 30 times (objectsearch excluded), the results are averaged and the 95% conﬁdence interval is computed. However, for every instance we replan from scratch for a fair comparison with SST. In addi- tion, time and number of samples refers to the plan execution of one instance. The results (Table 1) highlight that our planner obtains generally better results than SST, especially at higher horizons. HYPE obtains good results in discrete domains but does not reach state-of-art results (score 1) for two main reasons. The ﬁrst is the lack of a heuristic, that can dramatically improve the perfor- mance, indeed, heuristics are an important component of PROST , the IPPC winning planner. The second reason is the time performance that allows us to sample a limited number of episodes and will not allow to ﬁnish all the IPPC 2011 domains in 24 hours. This is caused by the expensive Q-function evalu- ation; however, we are conﬁdent that heuristics and other improvements will signiﬁcantly improve performance and results.
From the above data, we can conclude that using variable cheap iterations has better performance in almost all cases. For the fixed cheap iteration approach, 10 cheap iterations (C10) is almost always best. For the variable cheap iteration styles (1,2 and 3), parameter sets S1-a, S2-b, and S3-b are generally the best. Figure 6.1 shows the trend of the CPU time requirement for the different solution methods as the number of states gets larger. In general, the CPU time grows geometrically with state space size as many others have noted. The slight plateau between 10,000 and 12,696 may be because the structure of the problem changes from two stages (10,000 states and fewer) vs. three inventory stages (12,696 states and larger). From the figure we can see that for 10,000 states or fewer, S1-a and S2-b have almost the same CPU requirement while that of S3-b is little higher but still less than C10’s. When there are more than 10,000 states, S3-b becomes better than S2-b and very close to S1-a. In all cases the best variable cheap iteration method within each style performs better than fixed cheap iterations.
E MBODIED agents like robots are used in increasingly complex, real-world domains, such as domestic environments (Iocchi et al., 2012), health care (Okamura et al., 2010), and extraterrestrial settings (Grotzinger et al., 2012). These environments are often unstructured, populated by humans, and changing over time. At the same time, robots are becoming increasingly sophisticated, both in terms of their hardware and their control software, see, e.g., Lemburg et al. (2011) and Bartsch et al. (2012). A simple, reactive control approach (Brooks, 1986) is not sufficient for these systems as it lacks the ability to predict and control the environment on larger scales of time and space. For this, agents must be able to build up both procedural and declarative knowledge 1 about the world and store this knowledge in a convenient way so that it can be reused and adapted easily. This requires robotic control architectures which allow learning, utilization, combination, integration, and adaptation of procedural and declarative knowledge. A multitude of robot control architectures has been proposed over the last years (see Murphy (2000) for a discussion and an overview). Figure 1.1 presents one example of a 3-layered, “hybrid” control architecture.
Two subjects with large amounts of missing data were excluded. All other missing values are imputed with the fitted value from a local polynomial regression of the state variable on time t . We treat all the variables as ordinal, partitioning some of them (see Appendix B.2 for a complete description). We estimate φ Ò ADNN (s t ) wherein conditional independence is checked via condition (i) in Lemma 3.4 using a likelihood ratio test (LRT). (Recall that we use distance covariance test for continuous states and LRT for categorical / ordinal states. See Appendix B.3 for more details on the LRT.) The dimension of φ Ò ADNN ( s t ) is set to be the smallest dimension for which φ Ò ADNN ( s t ) fails to reject this independence condition at level τ = 0.05; this procedure resulted in a feature set of dimension six. To increase the interpretability of the constructed feature map, we constrained the dimension reduction network to have no hidden layers. Under this constraint, φ Ò ADNN (s t ) is a linear transformation of s t followed by application of Φ ◦ which was set to be the arctangent function. A plot of the
First of all, the random storage policy prescribes to randomly choose an empty location in the warehouse for the incoming products or pallets. The random assignment method results in a high space utilization at the expense of increased travel distance (Sharp et al., 1991). Secondly, if one applies the closest open location storage policy one always uses the first open empty location to store the product or pallet. Finally, in the class-based storage policy all products are divided into classes based on a measure of product demand, such as cube per order index (COI) as proposed by Heskett (1963) and Heskett (1964). Specific areas of the warehouse are dedicated to each of these classes. The storage within this area is done randomly. Most of the time the number of classes is restricted to three, although more classes might reduce travel times in some cases (de Koster et al., 2007). Special cases of class-based storage are dedicated storage and random storage. In the case of dedicated storage one class consists of one kind of product, while in the case of random storage there is only one class that contains all products. Several studies have tried to optimise routing and picking using Markovdecisionprocesses. Bukchin et al. (2012) used an MDP to optimise the trade-off of going on a picking tour or waiting until more orders have arrived. Furthermore, Hodgson et al. (1985) applied a semi-Markovdecision process to develop general control rules for routing an automated guided vehicle. However, even though a lot of studies have focused on optimising storage processes, MDPs have not yet been considered in this regard. Hausman et al. (1976) determined optimal boundaries for a class-based storage assignment with two and with three classes considering the racks of the warehouse to be square-in-time, meaning that the horizontal travel time equals the vertical travel time. This method has been extended by Eynan and Rosenblatt (1994) to any rectangular rack and by Rosenblatt and Roll (1988) to any number of classes.
In this chapter, we consider a risk-sensitive continuous-timeMarkovdecision process over a finite time duration. From the results of chapter 5 about the PDMDP, it is naturally to think that whether we can extend the finite horizon CTMDP problem with nonnegative cost rates to the unbounded case. At the same time, considering discounted CTMDP problem with a lower bounding function in chapter 6, where the technique used there is a transformation from general case to the nonnegative cost rate. If we can use the similar transformation, then it is just an application of the risk-sensitive PDMDP results. Unfortunately, we still don’t know how to combine these two ways together to get what we want for the finite horizon risk-sensitive CTMDP with unbounded cost rates, so we change a way to look for the modified Feyman-Kac formula to get the results. In the following, under the conditions that can be satisfied by unbounded transition and cost rates, we show the existence of an optimal policy, and the existence and uniqueness of the solution to the optimality equation out of a class of possibly unbounded functions, to which the Feynman-Kac formula was also justified to hold.
literal at a time (2-1-1-2). This example aims to illustrate benefits of SPMI over other update strategies and target choices. Further scenarios are reported in Appendix E.1 . In Figure 1 , we show the behavior of the different update strate- gies starting from a uniform initialization. We can see that both SPMI and SPMI-sup perform the policy updates and the model updates in sequence. This is a consequence of the fact that, by looking only at the local advantage function, it is more convenient for the student to learn an almost optimal policy with no intervention on the teacher and then refining the teacher model to gain further reward. The joint and adaptive strategy of SPMI outperforms both SPMI-sup and SPMI-alt. The alternated model-policy update (SPMI-alt) is not convenient since, with an initial poor-performing policy, updating the model does not yield a significant performance improvement. It is worth noting that all the methods con- verge in a finite number of steps and the learning rates α and β exhibit an exponential growth trend.
A discrete-timecontinuous MDP is defined as a 6-tuple S, A, P, R, γ, μ, where S is the continuous state space, A is the continuous action space, P is a Markovian transition model where P(s |s, a) defines the transition density between state s and s under action a , R : S ×A → [−R, R] is the reward function, such that R(s, a) is the expected immediate reward for the state-action pair (s, a) and R is the maximum absolute reward value, γ ∈ [0, 1) is the discount factor for future rewards, and μ is the initial state distribution. We assume state and action spaces to be complete, separable metric (Polish) spaces (S , d S ) and (A, d A ), equipped with their σ -algebras σ S , σ A of Borel sets, respectively. We assume—as done in Hinderer ( 2005 )—that joint state-action space is endowed with the following taxicab norm: d SA ((s, a) , (s, a )) = d S (s,s) + d A (a, a ). A stationary policy π(·|s) specifies for each state s the density function over the Borel action space (A, d A , σ A ).
The summary of our results is presented in Table 1 . A simple consequence of our results is that the Pareto curves can be approximated in pseudo-polynomial time in the case of the global variance, and in exponential time for the local variance.
1.2. Related work
Studying the trade-off between multiple objectives in an MDP has attracted signiﬁcant attention in the recent years (see  for an overview). In the formal veriﬁcation area, MDPs with multiple mean-payoff objectives  , discounted objec- tives  , cumulative reward objectives  , and multiple ω -regular objectives  have been studied. As for the stability of a system, the variance-penalized mean-payoff problem (where the mean payoff is penalized by a constant times the variance) under memoryless (stationary) strategies was studied in  . The mean-payoff variance trade-off problem for unichain MDPs was considered in  , where a solution using quadratic programming was designed; under memoryless strategies the problem was considered in  . All the above works for mean-payoff variance trade-off consider the global variance, and are restricted to memoryless strategies. The problem for general strategies and global variance was not solved before. Although restrictions to unichains or memoryless strategies are feasible in some areas, many systems modelled as MDPs might require more general approach. For example, a decision of a strategy to shut the system down might make it impossible to return the running state again, yielding a non-unichain MDP. Similarly, it is natural to synthesise strategies that change their decisions over time.
that is each instance is repeated 30 times (objectsearch excluded), the results are averaged and the 95% confidence interval is computed. However, for every instance we replan from scratch for a fair comparison with SST. In addition, time and number of samples refers to the plan execution of one instance. The results (Table 1) highlight that our planner obtains generally better results than SST, especially at higher horizons. HYPE obtains good results in discrete do- mains but does not reach state-of-art results (score 1) for two main reasons. The first is the lack of a heuristic, that can dramatically improve the performance, indeed, heuristics are an important component of PROST , the IPPC winning planner. The second reason is the time performance that allows us to sample a limited number of episodes and will not allow to finish all the IPPC 2011 domains in 24 hours. This is caused by the expensive Q-function evaluation; however, we are confident that heuristics and other improvements will significantly improve performance and results.
duration, see e.g., . The problem of finding a schedule for a fixed number of such (preemptable) jobs on a given set of identical machines such that the probability to meet a given deadline is maximised, is, in fact, an instance of timed reachability on CTMDPs. Optimal memoryless strategies exist for minimising the sum of the job completion times, but, as is shown, this is not the case for maximising the probability to reach the deadline. The same applies for maximising the probability to complete all jobs within a fixed cost.
The following remark explains the novelty of the current work and its connection to the previous results and the known methods. As was mentioned (see also section 5), the discounted cost is a special case of the considered model. Such a CTMDP was investigated in  where the statements similar to Theorems 1 and 2 were proved. Generally speaking, we use the same method of attack, but all the proofs must be carefully rewritten because of the following: (a) the occupation measures can take inﬁnite value; (b) Markov randomized strategies are not suﬃcient in optimization problems. The latter is conﬁrmed by Example 2. To cover this gap, we introduce the new suﬃcient class of Poisson-related ξ-strategies.
DOI: 10.4236/ojs.2019.92014 182 Open Journal of Statistics this paper is to find the policy with the minimal variance in the deterministic stationary policy class, which is different from the mean-variance criterion problem. The study of the mean-variance criterion problem is generally based on the discount criterion or the average criterion. In the literature of MDPs, many studies focus on the problem of expected reward optimization in finite stage, the discounted MDP in infinite stage and the average reward problem in infinite stage  . By establishing the optimal equation, then the existence of optimal policy is proved, and finally the policy iteration type algorithm is used to solve the MDP problem. However, in real-life, the optimal criteria of this un- constrained optimization problem are often not unique, such as queuing system and network problems. So we introduce variance to choose the optimal strategy. Variance is an important performance metric of stochastic systems. In finan- cial engineering, we use the mean to measure the expected return, and the va- riance to measure the risk. The mean-variance problem of the portfolio can be traced back to Markowitz . Then the Markowitz’s mean-variance portfolio problem has been studied -, the decision maker’s expected reward is often assumed to be a constant, and then the investor chooses a policy with a given ex- pected return to minimize this risk, we can see that the Markowitz mean-variance portfolio model is a model of maximization of return and minimization of risk. However, given expected return which may not be maximal, an optimal policy in Markowitz’ mean-variance portfolio may not be optimal in the usual sense of variance minimization problems for MDPs. Moreover, more and more real-life situations such as queuing systems and networks can be described as MDPs ra- ther than stochastic differential equations, so Markowitz’s mean-variance port- folio problem should be extended to MDPs. For mean-variance problem of the MDPs, as in   , we aim to obtain a variance optimal policy over a set of policies where the average reward or discounted reward is optimal, so the va- riance criterion can be transformed into an equivalent average or discount crite- rion. However, when the mean criterion is not optimal, it is not clear how to de- velop a policy iteration algorithm to solve the problem. For discrete-time, dis- count and long-run average variance criterion problem has been studied in  . They mainly consider the variance optimization problem, and do not con- strain the mean. For continuous-time, the variance of the average expected re- turn has been defined in deterministic stationary policy. The finite-horizon ex- pected reward is defined as below.
Another method of investigation is based on the study of the relation of the CTMDP problem and a DTMDP (discrete-timeMarkovdecision process) problem. Once the CTMDP problem is reduced to an equivalent DTMDP problem, one can directly make use of the toolbox of the better developed theory of DTMDPs [2, 4, 5, 11, 12, 15, 29] for the CTMDPs. This idea at least dates back to the 1970s; see Lippman , where the author applied the uniformization technique to reducing the CTMDP problem to a DTMDP problem; see also . However, the authors of [28, 36], not only required the transition rates to be uniformly bounded, also had to be restricted to the class of deterministic stationary policies, i.e., those that do not change actions between two consecutive state transitions.