Generalized hybrid factoredMDPs Discrete-state factoredMDPs (Boutilier, Dearden, & Gold- szmidt 1995) permit a compact representation of stochastic decision problems by exploiting their structure. In this sec- tion, we introduce a new formalism for representing hybrid factoredMDPs with an exponential-family transition model. This formalism is based on the HMDP framework (Guestrin, Hauskrecht, & Kveton 2004) and generalizes its mixture of beta transition model for continuous variables.
To overcome the limitations of the discussed constraint sat- isfaction techniques, we propose a novel Markov chain Monte Carlo (MCMC) method for finding the most violated con- straint of a relaxed HALP. The method directly operates in the domains of continuous variables, takes into account the structure of factoredMDPs, and its space complexity is pro- portional to the number of variables. Such a separation ora- cle can be easily embedded into the ellipsoid or cutting plane method for solving linear programs, and therefore constitutes a key step towards solving HALP efficiently.
The main drawback of linear models is the need for a good basis set. While these approaches may scale, the quality of the approximation depends critically on the underlying basis. If no decent approximate VF lies in the subspace spanned by the basis, it is impossible to obtain good solutions using such techniques. Unfortunately, in the recent work on linear approximations for factoredMDPs, no proposals exist for ei- ther: (a) the choice of a good basis; or (b) the modification of an existing basis to improve decision quality. Studies to date have used simple characteristic functions over (very small) subsets of state variables.
Structured probabilistic models, and particularly Bayesian networks, have revolutionized the field of reasoning under uncertainty by allowing compact representations of complex domains. Their success is built on the fact that this structure can be exploited effectively by inference and learning algo- rithms. This success leads one to hope that similar structure can be exploited in the context of planning and reinforce- ment learning under uncertainty. This paper, together with the recent work on representing and reasoning with factoredMDPs [Boutilier et al., 1999], demonstrate that substantial computational gains can indeed be obtained from these com- pact, structured representations.
In this paper we proposed a new state-space associative metric for factoredMDPs that draws inspiration from classical conditioning in nature. Our metric relies on identified associations between state-variables perceived by the learning agent during its interaction with the environment. These associations are learned using a sensory pattern-mining algorithm and determine the similarity between states, thus providing a state-space metric that requires no prior knowledge on the structure of the underlying decision problem. The sensory pattern-mining algorithm relies on the associative sensory tree, that captures the frequency of co-occurrence of stimuli in the agent’s environment.
We present the first framework that exploits problem struc- ture and solves large hybrid MDPs efficiently. The MDPs are modelled by hybrid factoredMDPs, where the stochastic dy- namics is represented compactly by a probabilistic graphical model, a hybrid dynamic Bayesian network (DBN) (Dean & Kanazawa 1989). The solution of the MDP is approximated by a linear combination of basis functions (Bellman, Kalaba, & Kotkin 1963; Bertsekas & Tsitsiklis 1996). Specifically, we use a factored (linear) value function (Koller & Parr 1999), where each basis function depends on a small number of state variables. We show that the weights of this approximation can be optimized using a convex formulation that we call hy-
Markov Decision Processes (MDP)  have become the de facto standard model for decision-theoretic planning problems and a great deal of research in recent years has aimed to exploit structure in order to compactly represent and eﬃciently solve factoredMDPs [2–5]. However, in many real-world problems, it is simply impossible to obtain a precise representation of the transition probabilities in an MDP. This may occur for many reasons, including (a) imprecise or conﬂicting elicitations from experts, (b) insuﬃcient data from which to estimate reliable precise transition models, or (c) non-stationary transition probabilities due to insuﬃcient state information.
Markov decision processes, or MDPs, are widely used to model stochastic control tasks. Many researchers have developed algorithms that determine optimal or near-optimal decision policies for MDPs. However, most of these algorithms scale poorly as the size of a task grows. Much recent research on MDPs has focused on finding task structure that makes it possible to simplify construction of a useful policy. In this paper, we present Variable Influence Structure Analysis, or VISA, an algorithm that identifies task structure in factoredMDPs and combines hierarchical decomposition and state abstraction to exploit task structure and simplify policy construction. VISA was first introduced in a conference paper (Jonsson and Barto, 2005); this paper provides more detail and additional insights as well as a new section on compact activity models.
To our knowledge, there have been no previous attempts to handle identiﬁcation of dead ends in MDPs. The “Sensitive but Slow” and “Fast but Insensitive” mechanisms were not actually designed speciﬁcally for the purpose of identifying dead ends and are unsatisfactory in many ways. One possible reason for this omission may be that most MDPs studied by the Artiﬁcial Intelligence and Operations Research communities until recently had no dead ends. However, MDPs with dead ends have been receiving attention in the past few years as researchers realized their probabilistic interestingness  . Besides the analogy to EBL, SixthSense can also be viewed as a machine learning algorithm for rule induction, similar in purpose, for example, to CN2  . While this analogy is valid, SixthSense operates under different requirements than most such algorithms, because we demand that SixthSense-derived rules (nogoods) have zero false-positive rate. Last but not least, our term “nogood” shares its name with and closely mirrors the concept from the areas of truth maintenance systems (TMSs)  and constraint satisfaction problems (CSPs)  . However, our methodology for ﬁnding nogoods has little in common with algorithms used in that literature.
Dietterich and Flann [32,33] also consider the application of regression methods to the solution of MDPs in the context of reinforcement learning. Their original proposal  is restricted to MDPs with goal regions and deterministic actions (represented using S TRIPS -like operators), thus rendering true goal-regression techniques directly applicable. They extend their approach in  to allow stochastic actions, thus providing a stochastic generalization of goal regression. One key difference between their model and ours is that they deal exclusively with goal-based problems whereas we allow general reward functions. Thus we might classify their work as stochastic regression and ours as decision- theoretic regression. The general motivation and spirit of their proposal is very similar to ours, but focuses on different representations. In the abstract, Dietterich and Flann simply require operators (actions) that can be inverted, and they develop grid-world navigation and chess end-games as examples of deterministic regression. In the stochastic case, Dietterich and Flann place an emphasis is on algorithms for manipulating rectangular regions of grid worlds. In contrast, our approach deals with general DBN/decision-tree representations of discrete, multi-variable systems. Our decision-tree representation has certain advantages in multi-variable domains (e.g., we will see below that it provides leverage for approximation). In navigation domains (to take one example), the region-based representation is clearly superior as they offer very little structure that can be exploited by a decision tree. Both approaches can be seen as particular instances of a more general approach to regression in MDPs.
In this evaluation, we use TCT Treebank as the developing and experimental data. The Tree- bank uses an annotation scheme with double- tagging (Zhou, 2004). Under this scheme, every sentence is annotated with a complete parse tree, where each non-terminal constituent is assigned with two tags, the syntactic constituent tag and the grammatical relation tag, which also is a new annotation scheme that differs from with head constituents in previous TCT version. In order to fit to this annotation of TCT, we use the unlexi- calized model to do the PCFG parsing and use CKY-based decoder in the Stanford parser. Fi- nally we mainly use TregEx (Levy, 2006), which is a useful tool to visualize and query syntactic structures, to generate a head propagation table applying to the factored model in order to im- prove the performance.
We contributed mainly in two directions: better analysis with an improved pipeline for Bulgarian, and different more complex types of factored models to explore successful factor combinations. We have experimented with a number of combinations of the listed factors, language model types (word and POS), translation and generation steps. The best performing model featuring a semantic factor for the direction BG→EN includes four factors: word form, lemma, POS and variable type; a word and POS- based language model. In the transfer step, two alternative approaches are used. If possible a mapping
is satisfied along paths of the IMDP M starting in state s and following policy σ. Regarding the definitions, IMDPs may be seen as an extension of MDPs with an infinite (even uncountable) set of actions, without taking into account the randomisation in policies. This makes their study a priori more complex. However one of the contributions of  regarding IMDPs is to show that their behaviour can be captured by finite MDPs. We now explain this reduction that we will use for proofs but not for algorithms since it constructs a finite MDP with a number of actions exponentially larger than the original IMDP. The main idea is to explicit the set of possible choices of probability distributions in Steps(a) for a given action a ∈ A(s). Recall that it consists of all distributions p ∈ Dist(S) such that P
took 3.31 secs for IO-Q-MMDP and 2696.23s for IO-Q-Dec- POMDP. Fig. 5 shows the results that indicate that the upper bound is relatively tight: the solutions found by TP are not far from the upper bound. In particular, the EAF lies typically between 1.4 and 1.7, thus demonstrating that IO-UBs can pro- vide firm guarantees for solutions of factored Dec-POMDPs with up to 700 agents. Moreover, we see that we see that the EAF stays roughly constant for the larger problem instances indicating that relative guarantees do not degrade as the num- ber of agents increase.
Modal logic represents knowledge that agents have about other agents’ knowledge. Probabilistic modal logic fur- ther captures probabilistic beliefs about probabilistic beliefs. Models in those logics are useful for understanding and de- cision making in conversations, bargaining situations, and competitions. Unfortunately, probabilistic modal structures are impractical for large real-world applications because they represent their state space explicitly. In this paper we scale up probabilistic modal structures by giving them a factored representation. This representation applies conditional inde- pendence for factoring the probabilistic aspect of the structure (as in Bayesian Networks (BN)). We also present two exact and one approximate algorithm for reasoning about the truth value of probabilistic modal logic queries over a model en- coded in a factored form. The first exact algorithm applies inference in BNs to answer a limited class of queries. Our second exact method applies a variable elimination scheme and is applicable without restrictions. Our approximate al- gorithm uses sampling and can be used for applications with very large models. Given a query, it computes an answer and its confidence level efficiently.
Based on the hypothesis that the factorisations are beneﬁcial when translation some sentences, and not when translating others, we completed an oracle- based evaluation, in which we assume to know a pri- ori whether to use the factored model for translating a given sentence, or just go with the baseline, unfac- tored model. In reality, we don’t have such an or- acle method for arbitrary sentences, but when deal- ing with the shared task test set (or other corpora for which we have reference translations), it was easy enough to check per-sentence BLEU scores for each model and make the decision based on a comparison. Table 1b lists BLEU scores obtainable with each factor conﬁguration given such an oracle method. In this scenario, most factored models beat the baseline, indicating that the factorisations are beneﬁcial for certain sentences, and detrimental for others.
The idea of divide and conquer through domain decomposi- tion has always appealed to planning researchers. In this pa- per we provided a formal study of some of the fundamental questions factored planning brings up. This study resulted in a number of key results and insights. First, it provides a novel factored planning approach that is more efficient than the best previous method of (Amir & Engelhardt 2003). Sec- ond, it identifies the domain’s causal graph as one of the key parameters in the complexity of factored and non-factored planning. Third, the complexity analysis provided enables us to compare between the complexity of standard and fac- tored methods, and provides new classes of tractable plan- ning problems. As we noted, these tractable classes appear to be of genuine practical interest, which has not often been the case for past results on tractable planning. Finally, our analysis helps to understand what makes one factorization better than another, and makes a concrete recommendation on how to factor a problem domain both in presence and in absence of additional domain knowledge.
First, we implemented the direct policy algorithm of Peshkin et al. (2000). This algorithm is designed to learn policies for MDPs on factored state and action spaces. To parameterize the policy, we used a feed-forward neural network with one hidden layer. The number of hidden units was cho- sen to match the number of hidden variables in the competing restricted Boltzmann machine. The output layer of the neural network consisted of a softmax unit for each action variable, which gave the probability of executing each value for that action variable. For example, if an action variable has four possible values, then there are separate inputs (weights and activations) entering the output unit for each of the four possible values. The output unit produces four normalized probabilities by first exponentiating the values, and then normalizing by the sum of the four exponentiated values.