Future Research - Bayesian Optimisation for Planning And Reinforcement Learning

EMU-Q is experimentally shown to outperform other exploration tech- niques

EMU-Q was evaluated on a large benchmark of reinforcement learning tasks and on a realistic simulated robotic reaching task. The method compared favourably against classic exploration techniques and more advanced methods such as intrinsic RL with additive rewards.

6.2 Future Research

Although experimentally successful, the algorithms presented in this thesis could be improved in several ways. Some avenues worth investigating include:

Mapping actions to returns in CBTS

Mappings from actions to rewards are currently learned at a node level in CBTS. Ideally, one wishes to map actions to their returns (or Monte Carlo estimates of returns) so that the CBTS branch selection metric is completely non-myopic. Learning a mapping from actions to rewards was chosen for ease of implementation, as a GP model with homoscedastic noise suffices under mild assumptions. However, the distribution of returns given actions often has variable variance and, in more complex cases, is not necessarily Gaussian. Learning such mapping could be addressed using GP with heteroscedastic noise in the first case, and more advanced probabilistic models in the latter.

Lastly, CBTS defines POMDP rewards as a function of the agent’s belief, which could be better modelled under the ρPOMDP framework [1]. In cases where some elements of the POMDP are unknown (e.g. transition dynamics), formulating the problem as a ρPOMDP would ensure an optimal solution can be found.

Integrating observations to CBTS

The CBTS algorithm is based on PO-UCT and does not integrate observation in the tree search. This limits the applicability of the method to a subclass of domains, as state estimation errors compound in deeper tree nodes and may decrease planning performance. This problem can be solved by extending CBTS to other types of MCTS algorithms which take observations into account.

Extending the RL framework and EMU-Q to POMDPs

The proposed reinforcement learning framework for explicit exploration-exploitation balance and its implementation EMU-Q are based on Markov decision processes. MDPs have a built-in assumption that environment states can be exactly and fully obtained at every time step. This assumption is not realistic, and is lifted by the POMDP framework used in Chapters 3 and 4. Adapting the reinforcement learning work of Chapter 5 to POMDPs would be a very interesting avenue for future work and greatly improve its applicability to robotics problems.

References

[1] Mauricio Araya, Olivier Buffet, Vincent Thomas, and Françcois Charpillet. A POMDP extension with belief-dependent rewards. In Advances in Neural Infor-

mation Processing Systems (NeurIPS), 2010.

[2] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002.

[3] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through Bayesian deep Q-networks. arXiv preprint arXiv:1802.04412, 2018.

[4] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In

Advances in Neural Information Processing Systems (NeurIPS), 2016.

[5] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning

(ICML), 2017.

[6] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In IEEE Conference on Decision and Control, 1995.

[7] Jonathan Binney and Gaurav S Sukhatme. Branch and bound for informative path planning. In IEEE International Conference on Robotics and Automation

(ICRA), 2012.

[8] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer- Verlag, Berlin, Heidelberg, 2006.

[9] Jean Bourgain. On Lipschitz embedding of finite metric spaces in Hilbert space.

Israel Journal of Mathematics, 1985.

[10] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning

Research (JMLR), 3, 2002.

[11] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010. [12] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schul-

[13] Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Rein-

forcement learning and dynamic programming using function approximators. CRC

press, 2017.

[14] Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing

Systems (NeurIPS), 2005.

[15] Adrien Couetoux, Mario Milone, Matyas Brendel, Hassan Doghmen, Michele Sebag, and Olivier Teytaud. Continuous rapid action value estimates. In Asian

Conference on Machine Learning (ACML), 2011.

[16] Dennis D Cox and Susan John. A statistical method for global optimization. In

IEEE International Conference on Systems, Man and Cybernetics, 1992.

[17] Dennis D Cox and Susan John. Sdo: A statistical method for global optimization.

Multidisciplinary design optimization: state of the art, 1997.

[18] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In

Association for the Advancement of Artificial Intelligence (AAAI), 1998.

[19] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research (JAIR), 13, 2000.

[20] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In International Conference on Machine Learning (ICML), 2005. [21] Lior Fox, Leshem Choshen, and Yonatan Loewenstein. DORA the explorer:

Directed outreaching reinforcement action-selection. In International Conference

on Learning Representations (ICLR), 2018.

[22] Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel Bayes’ rule. In Advances

in Neural Information Processing Systems (NeurIPS), 2011.

[23] I Gihman and A Skorohod. The theory of stochastic processes, vol. i, 1974. [24] Ruijie He, Emma Brunskill, and Nicholas Roy. Puma: Planning under uncertainty

with macro-actions. In Association for the Advancement of Artificial Intelligence

(AAAI), 2010.

[25] Geoffrey A Hollinger, Brendan Englot, Franz Hover, Urbashi Mitra, and Gaurav S Sukhatme. Uncertainty-driven view planning for underwater inspection. In IEEE

International Conference on Robotics and Automation (ICRA), 2012.

[26] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in

Neural Information Processing Systems (NeurIPS), 2016.

[27] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research (JMLR), 11, 2010.

References 121

[28] Donald R Jones, Cary D Perttunen, and Bruce E Stuckman. Lipschitzian op- timization without the lipschitz constant. Journal of Optimization Theory and

Applications, 79, 1993.

[29] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13, 1998.

[30] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research (JAIR), 4, 1996. [31] Motonobu Kanagawa, Yu Nishiyama, Arthur Gretton, and Kenji Fukumizu.

Filtering with state-observation examples via kernel Monte Carlo filter. Neural

computation, 2016.

[32] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 2002.

[33] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in

Neural Information Processing Systems (NeurIPS), 2000.

[34] George Konidaris, Sarah Osentoski, and Philip S Thomas. Value function approx- imation in reinforcement learning using the fourier basis. In Association for the

Advancement of Artificial Intelligence (AAAI), 2011.

[35] Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014.

[36] Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In Robotics:

Science and Systems (RSS), 2008.

[37] Hanna Kurniawati and Vinay Yadav. An online pomdp solver for uncertainty planning in dynamic environment. Robotics Research, 2016.

[38] Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Fluids Engineering, 1964. [39] Yoshiaki Kuwata, Gaston A Fiore, Justin Teo, Emilio Frazzoli, and Jonathan P

How. Motion planning for urban driving using RRT. In IEEE/RSJ International

Conference on Intelligent Robots and Systems, 2008.

[40] Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal

of Machine Learning Research (JMLR), 4, 2003.

[41] Steven M. Lavalle. Rapidly-exploring random trees: A new tool for path planning. 1998.

[42] Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal R Figueiras-Vidal, et al. Sparse spectrum Gaussian process regression.

[43] Daniel Ying-Jeh Little and Friedrich Tobias Sommer. Learning and exploration in action-perception loops. Frontiers in neural circuits, 7, 2013.

[44] Manuel Lopes, Tobias Lang, Marc Toussaint, and Pierre-Yves Oudeyer. Explo- ration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems (NeurIPS), 2012. [45] Wenjie Lu, Guoxian Zhang, and Silvia Ferarri. An information potential approach

to integrated sensor path planning and control. IEEE Transactions on Robotics, (4), 2014.

[46] Shie Mannor, Reuven Y Rubinstein, and Yohai Gat. The cross entropy method for fast policy search. In International Conference on Machine Learning (ICML), 2003.

[47] Román Marchant and Fabio Ramos. Bayesian optimisation for intelligent environ- mental monitoring. In IEEE/RSJ International Conference on Intelligent Robots

and Systems, 2012.

[48] Román Marchant and Fabio Ramos. Bayesian optimisation for informative continu- ous path planning. In IEEE International Conference on Robotics and Automation

(ICRA), 2014.

[49] Román Marchant, Fabio Ramos, and Scott Sanner. Sequential Bayesian optimisa- tion for spatial-temporal monitoring. In Conference on Uncertainty in Artificial

Intelligence (UAI), 2014.

[50] Zita Marinho, Anca Dragan, Arun Byravan, Byron Boots, Siddhartha Srinivasa, and Geoffrey Gordon. Functional gradient motion planning in reproducing kernel Hilbert spaces. arXiv preprint arXiv:1601.03648, 2016.

[51] Ruben Martinez-Cantin, Nando de Freitas, Arnaud Doucet, and José A Castellanos. Active policy learning for robot planning and exploration under uncertainty. In

Robotics: Science and Systems (RSS), 2007.

[52] Nicolas Meuleau and Paul Bourgine. Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning, 35, 1999. [53] Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement

learning with less data and less time. Machine Learning, 13, 1993.

[54] Philippe Morere, Román Marchant, and Fabio Ramos. Sequential Bayesian optimisation as a POMDP for environment monitoring with UAVs. In International

Conference on Robotics and Automation (ICRA), 2017.

[55] Philippe Morere, Román Marchant, and Fabio Ramos. Continuous state-action- observation POMDPs for trajectory planning with Bayesian optimisation. In

International Conference on Intelligent Robots and Systems (IROS), 2018.

[56] Philippe Morere and Fabio Ramos. Bayesian RL for goal-only rewards. In

References 123

[57] Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for rein- forcement learning. In International Conference on Machine Learning (ICML), 2010.

[58] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In International

Conference on Machine Learning (ICML), 1999.

[59] Andrew Y Ng and Michael Jordan. Pegasus: A policy search method for large MDPs and POMDPs. In Conference on Uncertainty in Artificial Intelligence

(UAI), 2000.

[60] Ali Nouri and Michael L Littman. Multi-resolution exploration in continuous spaces. In Advances in Neural Information Processing Systems (NeurIPS), 2009. [61] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing

Systems (NeurIPS), 2016.

[62] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning

(ICML), 2016.

[63] Pierre-Yves Oudeyer and Frederic Kaplan. How can we define intrinsic motiva- tion? In International Conference on Epigenetic Robotics: Modeling Cognitive

Development in Robotic Systems, 2008.

[64] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity- driven exploration by self-supervised prediction. In International Conference on

Machine Learning (ICML), 2017.

[65] Richard P Paul. Robot manipulators: mathematics, programming, and control:

the computer control of robot manipulators. Richard Paul, 1981.

[66] Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. Point-based value iteration: An anytime algorithm for pomdps. In International Joint Conference on Artificial

Intelligence (IJCAI), 2003.

[67] R. Platt, R. Tedrake, L. Kaelbling, and T. Lozano-Perez. Belief space planning assuming maximum likelihood observations. In Robotics: Science and Systems

(RSS), 2010.

[68] Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning

Research (JMLR), 6, 2005.

[69] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NeurIPS), 2008.

[70] Aravind Rajeswaran, Kendall Lowrey, Emanuel V. Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural

Information Processing Systems (NeurIPS), 2017.

[71] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced

lectures on machine learning. Springer, 2004.

[72] Chris Reinke, Eiji Uchibe, and Kenji Doya. Average reward optimization with multiple discounting reinforcement learners. In International Conference on Neural

Information Processing (ICONIP), 2017.

[73] Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial

Intelligence Research (JAIR), 2013.

[74] Stéphane Ross, Joelle Pineau, Sébastien Paquet, and Brahim Chaib-Draa. Online planning algorithms for pomdps. Journal of Artificial Intelligence Research (JAIR), 2008.

[75] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist

systems. University of Cambridge, Department of Engineering Cambridge, England,

1994.

[76] Alireza Sahraei, Mohammad Taghi Manzuri, Mohammad Reza Razvan, Masoud Tajfard, and Saman Khoshbakht. Real-time trajectory generation for mobile robots. In Congress of the Italian Association for Artificial Intelligence, 2007. [77] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz.

Trust region policy optimization. In International Conference on Machine Learning

(ICML), 2015.

[78] Konstantin M Seiler, Hanna Kurniawati, and Surya PN Singh. An online and approximate solver for pomdps with continuous action space. In IEEE International

Conference on Robotics and Automation (ICRA), 2015.

[79] David Silver and Joel Veness. Monte Carlo planning in large pomdps. In Advances

in Neural Information Processing Systems (NeurIPS), 2010.

[80] Richard D Smallwood and Edward J Sondik. The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21, 1973. [81] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp

planning with regularization. In Advances in Neural Information Processing

Systems (NeurIPS), 2013.

[82] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. Signal Processing Magazine, 2013.

[83] Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in re- inforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.

References 125

[84] Jan Storck, Sepp Hochreiter, and Jürgen Schmidhuber. Reinforcement driven infor- mation acquisition in non-deterministic environments. In International Conference

on Artificial Neural Networks (ICANN), 1995.

[85] Dougal J Sutherland and Jeff Schneider. On the error of random fourier features. In Conference on Uncertainty in Artificial Intelligence (UAI), 2015.

[86] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2, 1991.

[87] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.

[88] István Szita and András Lőrincz. The many faces of optimism: a unifying approach. In International Conference on Machine Learning (ICML), 2008.

[89] Jur Van Den Berg, Sachin Patil, and Ron Alterovitz. Efficient approximate value iteration for continuous Gaussian POMDPs. In Association for the Advancement

of Artificial Intelligence (AAAI), 2012.

[90] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine Learning, 8, 1992.

[91] Ronald J Williams. Simple statistical gradient-following algorithms for connec- tionist reinforcement learning. Machine Learning, 8, 1992.

[92] Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using trajectory data to improve Bayesian optimization for reinforcement learning. The Journal of Machine Learning

Research (JMLR), 15, 2014.

[93] Jonas Witt and Matthew Dunbabin. Go with the flow: Optimal auv path planning in coastal environments. In Australian Conference on Robotics and Automation

(ACRA), 2008.

[94] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. In International Conference on

Machine Learning (ICML), 2014.

[95] Timothy Yee, Viliam Lis`y, and Michael H Bowling. Monte Carlo tree search in continuous action spaces with execution uncertainty. In International Joint

In document Bayesian Optimisation for Planning And Reinforcement Learning (Page 145-154)