Abstract. Machine Learning algorithms are becoming more prevalent in critical systems where dynamic decision making and efficiency are the goal. As is the case for complex and safety-critical systems, where cer- tain failures can lead to harm, we must proactively consider the safety assurance of such systems that use Machine Learning. In this paper we explore the implications of the use of ReinforcementLearning in particu- lar, considering the potential benefits that it could bring to safety-critical systems, and our ability to provide assurances on the safety of systems incorporating such technology. We propose a high-level argument that could be used as the basis of a safety case for ReinforcementLearning systems, where the selection of ‘reward’ and ‘cost’ mechanisms would have a critical effect on the outcome of decisions made. We conclude with fundamental challenges that will need to be addressed to give the confidence necessary for deploying ReinforcementLearning within safety- critical applications.
Q-learning is a popular temporal-difference reinforcementlearning algorithm which often explicitly stores state values using lookup tables. This implementation has been proven to converge to the optimal solution, but it is often beneficial to use a func- tion-approximation system, such as deep neural networks, to estimate state values. It has been previously observed that Q-learning can be unstable when using value func- tion approximation or when operating in a stochastic environment. This instability can adversely affect the algorithm’s ability to maximize its returns. In this paper, we present a new algorithm called Multi Q-learning to attempt to overcome the instabil- ity seen in Q-learning. We test our algorithm on a 4 × 4 grid-world with different stochastic reward functions using various deep neural networks and convolutional networks. Our results show that in most cases, Multi Q-learning outperforms Q- learning, achieving average returns up to 2.5 times higher than Q-learning and hav- ing a standard deviation of state values as low as 0.58.
In this paper, we proposed the use of mean feature embeddings as state representations to overcome two major problems in deep reinforcementlearning for swarms: the high and possibly changing dimensionality of information perceived by each agent. We introduced three different approaches to realize such embeddings — two manually designed approaches based on histograms / radial basis functions and an end-to-end learned neural network fea- ture representation. We evaluated the approaches on different variations of the rendezvous and pursuit evasion problem and compared their performances to that of a naive feature concatenation method and classical approaches found in the literature. Our evaluation re- vealed that learning embeddings end-to-end using neural network features scales well with increasing agent numbers, leads to better performing policies, and often results in faster con- vergence compared to all other approaches. As expected, the naive concatenation approach fails for larger system sizes.
In this work, we propose taking a data-driven approach to train a model that can conduct evalu- ation in learning for paraphrasing generation. The framework contains two modules, a generator (for paraphrase generation) and an evaluator (for para- phrase evaluation). The generator is a Seq2Seq learning model with attention and copy mecha- nism (Bahdanau et al., 2015; See et al., 2017), which is first trained with cross entropy loss and then fine-tuned by using policy gradient with su- pervisions from the evaluator as rewards. The evaluator is a deep matching model, specifically a decomposable attention model (Parikh et al., 2016), which can be trained by supervised learn- ing (SL) when both positive and negative exam- ples are available as training data, or by inverse reinforcementlearning (IRL) with outputs from the generator as supervisions when only positive examples are available. In the latter setting, for the training of evaluator using IRL, we develop a novel algorithm based on max-margin IRL prin- ciple (Ratliff et al., 2006). Moreover, the gener- ator can be further trained with non-parallel data, which is particularly effective when the amount of parallel data is small.
Rapid advances of hardware-based technologies during the past decades have opened up new possibilities for Life scientists to gather multimodal data in various application domains (e.g., Omics, Bioimaging, Medical Imaging, and [Brain/Body]-Machine Interfaces), thus generating novel opportunities for development of dedicated data intensive machine learning techniques. Overall, recent research in Deep learning (DL), Reinforcementlearning (RL), and their combination (Deep RL) promise to revolutionize Artificial Intelligence. The growth in computational power accompanied by faster and increased data storage and declining computing costs have already allowed scientists in various fields to apply these techniques on datasets that were previously intractable for their size and complexity. This review article provides a comprehensive survey on the application of DL, RL, and Deep RL techniques in mining Biological data. In addition, we compare performances of DL techniques when applied to different datasets across various application domains. Finally, we outline open issues in this challenging research area and discuss future development perspectives.
In this paper we propose a simplification model which draws on insights from neural machine translation (Bahdanau et al., 2015; Sutskever et al., 2014). Central to this approach is an encoder- decoder architecture implemented by recurrent neural networks. The encoder reads the source sequence into a list of continuous-space repre- sentations from which the decoder generates the target sequence. Although our model uses the encoder-decoder architecture as its backbone, it must also meet constraints imposed by the sim- plification task itself, i.e., the predicted output must be simpler, preserve the meaning of the in- put, and grammatical. To incorporate this knowl- edge, the model is trained in a reinforcementlearning framework (Williams, 1992): it explores the space of possible simplifications while learn- ing to maximize an expected reward function that encourages outputs which meet simplification- specific constraints. Reinforcementlearning has been previously applied to extractive summariza- tion (Ryang and Abekawa, 2012), information ex- traction (Narasimhan et al., 2016), dialogue gen- eration (Li et al., 2016), machine translation, and image caption generation (Ranzato et al., 2016).
The basic idea of reproducing-kernel methods is to apply the “kernel trick” in the context of reinforcementlearning (Sch¨ olkopf and Smola, 2002). Roughly speaking, the approxima- tion problem is rewritten in terms of inner products only, which are then replaced by a properly-defined kernel. This modification corresponds to mapping the problem to a high- dimensional feature space, resulting in more expressiveness of the function approximator. Perhaps the most natural way of applying the kernel trick in the context of reinforcementlearning is to “kernelize” some formulation of the value-function approximation problem (Xu et al., 2005; Engel et al., 2005; Farahmand, 2011). Another alternative is to approximate the dynamics of an MDP using a kernel-based regression method (Rasmussen and Kuss, 2004; Taylor and Parr, 2009). Following a slightly different line of work, Bhat et al. (2012) propose to kernelize the linear programming formulation of dynamic programming. How- ever, this method is not directly applicable to reinforcementlearning, since it is based on the assumption that one has full knowledge of the MDP. A weaker assumption is to suppose that only the reward function is known and focus on the approximation of the transition function. This is the approach taken by Grunewalder et al. (2012), who propose to em- bed the conditional distributions defining the transitions of an MDP into a Hilbert space induced by a reproducing kernel.
To achieve these goals, we draw on the insights of reinforcementlearning, which have been widely ap- plied in MDP and POMDP dialogue systems (see Re- lated Work section for details). We introduce a neu- ral reinforcementlearning (RL) generation method, which can optimize long-term rewards designed by system developers. Our model uses the encoder- decoder architecture as its backbone, and simulates conversation between two virtual agents to explore the space of possible actions while learning to maxi- mize expected reward. We define simple heuristic ap- proximations to rewards that characterize good con- versations: good conversations are forward-looking (Allwood et al., 1992) or interactive (a turn suggests a following turn), informative, and coherent. The pa- rameters of an encoder-decoder RNN define a policy over an infinite action space consisting of all possible
Effective diffusion of knowledge has been studied in many fields, including inverse reinforcementlearning (Ng and Russell 2000), apprenticeship learning (Abbeel and Ng 2004), and learning from demonstration (Argall et al. 2009), wherein students discern and emulate key demonstrated be- haviors. Works on curriculum learning (Bengio et al. 2009) are also related, particularly automated curriculum learn- ing (Graves et al. 2017). Though Graves et al. focus on single student supervised/unsupervised learning, they high- light interesting measures of learning progress also used here. Several works meta-learn active learning policies for supervised learning (Bachman, Sordoni, and Trischler 2017; Fang, Li, and Cohn 2017; Pang, Dong, and Hospedales 2018; Fan et al. 2018). Our work also uses advising-level meta- learning, but in the regime of MARL, where agents must learn to advise teammates without destabilizing coordination. In action advising, a student executes actions suggested by a teacher, who is typically an expert always advising the opti- mal action (Torrey and Taylor 2013). These works typically use state importance value I(s, ˆ a) = max a Q(s, a)− Q(s, ˆ a)
To better engage with RL-based generative art, the dissertation creates RL5, a JavaScript library built on top of p5.js to improve the accessibility of reinforcementlearning for creatives. RL5 allows developers to define their own RL environments in native p5.js language and train RL policies in web browsers. RL5 provides three RL algorithms to cover the four possible combinations of different types of state and action spaces, and nine RL environments to serve as building blocks for constructing complex systems. With the focus on simplicity and (re)usability, the APIs of RL5 enables users to create, train, and evaluate an RL agent in less than 20 lines of codes. The library is demonstrated in an RL environment called Avoid An Obstacle in which the goal is to train an agent to move on a 2D rectangular area from left to right without hitting a rectangle in the middle. With the same training settings but different random seeds, the agent develops different strategies to accomplish the task.
Collaborative ReinforcementLearning (CRL) is a bottom-up approach to tackling the complex time- varying problems of engineering autonomic behaviour for distributed systems where there is no support for global state. It is an extension to Reinforcement Learn- ing [2] (RL) for solving system-wide optimisation prob- lems in decentralised multi-agent systems. In CRL, in- dividual agents solve discrete optimisation problems us- ing RL and share solution information with their neigh- bours, contributing towards the solution of the system- wide optimisation problem. Agents are part of a dy- namic population, with support for agents joining and leaving the system and establishing connections with neighbours. CRL does not make use of system-wide knowledge and individual agents only know about and
Deep reinforcementlearning (DRL) has achieved significant breakthroughs in various tasks. How- ever, most DRL algorithms suffer a problem of generalising the learned policy, which makes the policy performance largely affected even by mi- nor modifications of the training environment. Except that, the use of deep neural networks makes the learned policies hard to be inter- pretable. To address these two challenges, we propose a novel algorithm named Neural Logic ReinforcementLearning (NLRL) to represent the policies in reinforcementlearning by first-order logic. NLRL is based on policy gradient methods and differentiable inductive logic programming that have demonstrated significant advantages in terms of interpretability and generalisability in supervised tasks. Extensive experiments con- ducted on cliff-walking and blocks manipulation tasks demonstrate that NLRL can induce inter- pretable policies achieving near-optimal perfor- mance while showing good generalisability to environments of different initial states and prob- lem sizes.
Temporal difference methods are theoretically grounded and empirically effective methods for ad- dressing reinforcementlearning problems. In most real-world reinforcementlearning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investi- gates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves indi- viduals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers ef- fective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcementlearning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly im- prove the performance of TD methods and on-line evolutionary computation can significantly im- prove evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice.
Lyapunov design methods are used widely in control engineering to design controllers that achieve qualitative objectives, such as stabilizing a system or maintaining a system’s state in a desired operating range. We propose a method for constructing safe, reliable reinforcementlearning agents based on Lyapunov design principles. In our approach, an agent learns to control a system by switching among a number of given, base-level controllers. These controllers are designed using Lyapunov domain knowledge so that any switching policy is safe and enjoys basic performance guarantees. Our approach thus ensures qualitatively satisfactory agent behavior for virtually any reinforcementlearning algorithm and at all times, including while the agent is learning and taking exploratory actions. We demonstrate the process of designing safe agents for four different control problems. In simulation experiments, we find that our theoretically motivated designs also enjoy a number of practical benefits, including reasonable performance initially and throughout learning, and accelerated learning.
Our proposal for active Bayesian perception and reinforce- ment learning is tested with a simple but illustrative task of perceiving object curvature using tapping movements of a biomimetic fingertip with unknown contact location (Fig. 1). We demonstrate first that active perception with fixation point control strategy can give robust and accurate perception, but the reaction time and acuity depend strongly on the choice of fixation point and belief threshold. Next, we introduce a reward function of the decision outcome, which for illustra- tion is taken as a linear Bayes risk of reaction time and error. Interpreting each active perception strategy (parameterized by the decision threshold and fixation point) as an action, then allows use of standard reinforcementlearning methods for multi-armed bandits [20]. In consequence, the appropriate decision threshold is learnt to balance the risk of making mistakes versus the risk of reacting too slowly, while the fixation point is tuned to optimize both quantities.
One way to interpret the individual experts in the product model is that they are learning “macro” or “basis” actions. As we have seen with the Blockers task, the hidden variables come to represent sets of actions that are spatially and temporally localized. We can think of the hidden variables as representing “basis” actions that can be combined to form a wide array of possible actions. The benefit of having basis actions is that it reduces the number of possible actions, thus making exploration more efficient. The drawback is that if the set of basis actions do not span the space of all possible actions, some actions become impossible to execute. By optimizing the set of basis actions during reinforcementlearning, we find a set that can form useful actions, while excluding action combinations that are either not seen or not useful.
A DPP defines a probability distribution over the subsets from a ground set. The probability of a subset is propor- tional to the determinant of a principal submatrix of a posi- tive semidefinite matrix, where the submatrix is indexed by the items in the subset. A DPP thus assigns high probability to those subsets that have relevant and diverse items. DPPs have been used in machine learning applications, includ- ing recommendation of products (Gillenwater et al. 2014; Gartrell, Paquet, and Koenigstein 2017), summarization of documents or videos (Gong et al. 2014), hyper-parameter optimization (Kathuria, Deshpande, and Kohli 2016), and mini-batch sampling (Zhang, Kjellstr¨om, and Mandt 2017). DPPs have also been used for modeling neural spiking to better represent the negative correlation between neurons (Snoek, Zemel, and Adams 2013). DPPs, however, have never been used in reinforcementlearning. We will see that a DPP naturally appears with Determinantal SARSA when we choose actions according to the standard approach of Boltz- mann exploration.
Reinforcementlearning (RL) is a machine learning framework for solving sequential decision- making problems. Despite its successes in a number of different domains, including backgammon (Tesauro, 1994), job-shop scheduling (Zhang and Dietterich, 1995), dynamic channel allocation (Singh and Bertsekas, 1996), elevator scheduling (Crites and Barto, 1998), and helicopter flight control (Ng et al., 2004), current RL methods do not scale well to high dimensional domains—they can be slow to converge and require many training samples to be practical for many real-world problems. This issue is known as the curse of dimensionality: the exponential growth of the number of parameters to be learned with the size of any compact encoding of system state (Bellman, 1957). Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting abstraction in RL. This leads naturally to hierarchical control architectures and associated learning algorithms.
On-line reinforcementlearning agents is difficult to train. It takes long to train because the agent does not have direct answer to the input in hand, it has to rely on own assessment of how good or bad the last action was (in the long run) to achieve a goal. For real world agent the difficulty is escalated due to partial observability and variability of experience as well as unfeasibility of deliberate repetition of this experience. Experiences vary from state and action perspectives. Subsequently, even when the agent starts form the same position and takes the same action the outcome is going to vary slightly due to the continuum of possible states at any location and due to inaccuracy of actions taken. This is especially true for agents with loos mechanics which are encountered in games and DYI robots. Moreover, it is physically difficult or undesirable to let the agent run through many episodes. Therefore, the agent needs to maximize the advantage of available experience with the least amount of time and repetition; hence offline reflection on past experience can play an important role to mitigate these difficulties.
Reinforcementlearning (RL, Sutton and Barto, 1998; Bertsekas and Tsitsiklis, 1996) provides a framework to autonomously learn control policies in stochastic environments and has become pop- ular in recent years for controlling robots (e.g., Abbeel et al., 2007; Kober and Peters, 2009). The goal of RL is to compute a policy which selects actions that maximize the expected future reward (called value). An agent has to make these decisions based on the state x ∈ X of the system. The state space X may be finite or continuous, but is in many practical cases too large to be represented directly. Approximated RL addresses this by choosing a function from function set F that resembles the true value function. Many function sets F have been proposed (see, e.g., Sutton and Barto, 1998; Kaelbling et al., 1996, for an overview). This article will focus on the space of linear functions with p non-linear basis functions { φ i ( · ) } i=1 p (Bertsekas, 2007), which we call approximation space F φ .