MDPs with convex cost functions

Chapter 4: Open Problems

4.2 MDPs with convex cost functions

We presented an exploration-exploitation algorithm to minimize regret in the periodic inventory control problem under censored demand, lost sales, and positive lead time, when compared to the best base-stock policy. By using convexity properties of the long run average cost function and a newly proven bound on the bias of base-stock policies, we extended a stochastic convex bandit

algorithm to obtain a simple algorithm that substantially improves upon the existing solutions for this problem. In particular, the regret bound for our algorithm maintains an optimal dependence on T , while also achieving a linear dependence on lead time. We believe that our bound given is the best an algorithm can do under these circumstances, however, there does not exist any rigorous justification on the lower regret bound for terms besides T . A complete proof on lower bound is an area of future work.

Another open problem to consider is when lead time is not deterministically L, but rather stochastic, e.g., if lead time follows some known distribution. While non-constant lead time is very difficult in general, it has a great deal of practical applications and would be a great advancement in inventory management. Our algorithm and analysis cannot be directly extended to such a case, but perhaps some modifications to the algorithm would allow us to handle some instances of stochastic lead time, such as if lead time is bounded by some known constant L. Finally, one may consider other algorithm techniques to solve this problem, such as online gradient descent type algorithms. In our work, we believe a stochastic bandit type algorithm was better due to not changing the policy often, however, further research can study whether other algorithms can be used to possibly tackle some of these extensions.

We also presented an example in stochastic queueing where an exploration-exploitation regret minimization algorithm can be used. We presented a simple case where benchmark policies are restricted to fixed-server policies where the number of servers is fixed throughout the time horizon. Extending learning algorithms to more dynamic benchmark policies that can change depending on the queue length is a very obvious problem extension. We believe that our algorithmic framework of leveraging the convexity properties of the problem setting can be useful in designing such an algorithm with favorable regret bounds.

References

[1] S. Bubeck, N. Cesa-Bianchi, et al., “Regret analysis of stochastic and nonstochastic multi- armed bandit problems,” Foundations and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.

[2] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933.

[3] O. Chapelle and L. Li, “An empirical evaluation of Thompson sampling,” in Advances in neural information processing systems, 2011, pp. 2249–2257.

[4] E. Kaufmann, N. Korda, and R. Munos, “Thompson Sampling: An Optimal Finite Time Analysis,” in International Conference on Algorithmic Learning Theory (ALT), 2012. [5] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit prob-

lem,” in Proceedings of the 25th Annual Conference on Learning Theory (COLT), 2012. [6] ——, “Further optimal regret bounds for thompson sampling,” in AISTATS, 2013, pp. 99–

107.

[7] ——, “Thompson sampling for contextual bandits with linear payoffs,” in Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.

[8] D. Russo and B. Van Roy, “An Information-Theoretic Analysis of Thompson Sampling,” Journal of Machine Learning Research (to appear), 2015.

[9] ——, “Learning to Optimize Via Posterior Sampling,” Mathematics of Operations Research, vol. 39, no. 4, pp. 1221–1243, 2014.

[10] S. Bubeck and C.-Y. Liu, “Prior-free and prior-dependent regret bounds for Thompson sampling,” in Advances in Neural Information Processing Systems, 2013, pp. 638–646.

[11] S. Agrawal and R. Jia, “Optimistic posterior sampling for reinforcement learning: Worst- case regret bounds,” in Advances in Neural Information Processing Systems, 2017, pp. 1184– 1194.

[12] ——, Posterior sampling for reinforcement learning: Worst-case regret bounds, 2017. eprint: arXiv:1705.07041.

[13] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1563–1600, 2010.

[14] P. L. Bartlett and A. Tewari, “REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2009, pp. 35–42.

[15] A. C. Y. Tossou, D. Basu, and C. Dimitrakakis, “Near-optimal optimistic reinforcement learning using empirical bernstein inequalities,” CoRR, vol. abs/1905.12425, 2019. arXiv: 1905.12425.

[16] I. Osband, D. Russo, and B. Van Roy, “(More) efficient reinforcement learning via posterior sampling,” in Advances in Neural Information Processing Systems, 2013, pp. 3003–3011. [17] Y. Abbasi-Yadkori and C. Szepesvari, “Bayesian optimal control of smoothly parameterized

systems: The lazy posterior sampling algorithm,” arXiv preprint arXiv:1406.3926, 2014. [18] I. Osband and B. Van Roy, “Why is posterior sampling better than optimism for reinforce-

ment learning,” arXiv preprint arXiv:1607.00215, 2016.

[19] R. Fonteneau, N. Korda, and R. Munos, “An optimistic posterior sampling strategy for

bayesian reinforcement learning,” in NIPS 2013 Workshop on Bayesian Optimization (BayesOpt2013), 2013.

[20] M. G. Azar, I. Osband, and R. Munos, “Minimax regret bounds for reinforcement learning,” arXiv preprint arXiv:1703.05449, 2017.

[21] S. M. Kakade, M. Wang, and L. F. Yang, “Variance reduction methods for sublinear reinforcement learning,” CoRR, vol. abs/1802.09184, 2018. arXiv: 1802.09184.

[22] A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for Markov decision processes,” Mathematics of Operations Research, vol. 22, no. 1, pp. 222–255, 1997.

[23] A. Tewari and P. L. Bartlett, “Optimistic linear programming gives logarithmic regret for irreducible MDPs,” in Advances in Neural Information Processing Systems, 2008, pp. 1505– 1512.

[24] M. J. Kearns and S. P. Singh, “Finite-sample convergence rates for Q-learning and indirect algorithms,” in Advances in neural information processing systems, 1999, pp. 996–1002. [25] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-

optimal reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002.

[26] S. M. Kakade et al., “On the sample complexity of reinforcement learning,” PhD thesis, University of London London, England, 2003.

[27] J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate, “A Bayesian sampling approach to exploration in reinforcement learning,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2009, pp. 19–26.

[28] C. Dann and E. Brunskill, “Sample complexity of episodic fixed-horizon reinforcement learning,” in Advances in Neural Information Processing Systems, 2015, pp. 2818–2826. [29] A. L. Strehl and M. L. Littman, “A theoretical analysis of model-based interval estimation,”

in Proceedings of the 22nd international conference on Machine learning, ACM, 2005, pp. 856–863.

[30] ——, “An analysis of model-based interval estimation for Markov decision processes,” Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1309–1331, 2008.

[31] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

[32] C. M. Grinstead and J. L. Snell, Introduction to probability. American Mathematical Soc., 2012.

[33] P. H. Zipkin, Foundations of inventory management. 2000.

[34] P. Zipkin, “Old and new methods for lost-sales inventory systems,” Operations Research, vol. 56, no. 5, pp. 1256–1263, 2008.

[35] M. Bijvank and I. F. Vis, “Lost-sales inventory theory: A review,” European Journal of Operational Research, vol. 215, no. 1, pp. 1–13, 2011.

[36] W. T. Huh, G. Janakiraman, J. A. Muckstadt, and P. Rusmevichientong, “Asymptotic opti- mality of order-up-to policies in lost sales inventory systems,” Management Science, vol. 55, no. 3, pp. 404–420, 2009.

[37] G. Janakiraman and R. O. Roundy, “Lost-sales problems with stochastic lead times: Con- vexity results for base-stock policies,” Operations Research, vol. 52, no. 5, pp. 795–803, 2004.

[38] A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin, “Stochastic convex optimization with bandit feedback,” in Advances in Neural Information Processing Systems, 2011, pp. 1035–1043.

[39] H. Zhang, X. Chao, and C. Shi, “Closing the gap: A learning algorithm for the lost-sales inventory system with lead times,” 2017.

[40] W. T. Huh, G. Janakiraman, J. A. Muckstadt, and P. Rusmevichientong, “An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand,” Mathematics of Operations Research, vol. 34, no. 2, pp. 397–416, 2009.

[41] R. R. Weber, “Note - on the marginal benefit of adding servers to g/gi/m queues,” Manage- ment Science, vol. 26, no. 9, pp. 946–951, 1980.

[42] H. L. Lee and M. A. Cohen, “A note on the convexity of performance measures of m/m/c queueing systems,” Journal of Applied Probability, vol. 20, no. 4, pp. 920–923, 1983. [43] J. G. Shanthikumar and D. D. Yao, “Optimal server allocation in a system of multi-server

stations,” Management Science, vol. 33, no. 9, pp. 1173–1180, 1987.

[44] W. T. Huh and P. Rusmevichientong, “A nonparametric asymptotic analysis of inventory planning with censored demand,” Mathematics of Operations Research, vol. 34, no. 1, pp. 103–123, 2009.

[45] O. Besbes and A. Muharremoglu, “On implications of demand censoring in the newsvendor problem,” Management Science, vol. 59, no. 6, pp. 1407–1424, 2013.

[46] G. Lugosi, M. G. Markakis, and G. Neu, “On the hardness of inventory management with censored demand data,” arXiv preprint arXiv:1710.05739, 2017.

[47] G. Bartók, D. P. Foster, D. Pál, A. Rakhlin, and C. Szepesvári, “Partial monitoring - classifi- cation, regret bounds, and algorithms,” Mathematics of Operations Research, vol. 39, no. 4, pp. 967–997, 2014.

[48] O. Besbes, Y. Gur, and A. Zeevi, “Non-stationary stochastic optimization,” Operations research, vol. 63, no. 5, pp. 1227–1244, 2015.

[49] J. Niño-Mora, “Dynamic priority allocation via restless bandit marginal productivity in- dices,” Top, vol. 15, no. 2, pp. 161–198, 2007.

[50] M. Larrnaaga, U. Ayesta, and I. M. Verloop, “Dynamic control of birth-and-death restless bandits: Application to resource-allocation problems,” IEEE/ACM Transactions on Net- working, vol. 24, no. 6, pp. 3812–3825, 2016.

[51] U. Ayesta, P. Jacko, and V. Novak, “Scheduling of multi-class multi-server queueing systems with abandonments,” Journal of Scheduling, vol. 20, no. 2, pp. 129–145, 2017.

[52] S. Krishnasamy, R. Sen, R. Johari, and S. Shakkottai, “Regret of queueing bandits,” in Ad- vances in Neural Information Processing Systems, 2016, pp. 1669–1677.

[53] Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer, “PAC-Bayesian inequalities for martingales,” IEEE Transactions on Information Theory, vol. 58, no. 12, pp. 7086–7093, 2012.

[54] R. Kleinberg, A. Slivkins, and E. Upfal, “Multi-armed bandits in metric spaces,” in Proceed- ings of the fortieth annual ACM symposium on Theory of computing, ACM, 2008, pp. 681– 690.

[55] I. G. Shevtsova, “An improvement of convergence rate estimates in the Lyapunov theorem,” vol. 82, no. 3, pp. 862–864, 2010.

[56] M. Abramowitz and I. A. Stegun, Handbook of mathematical functions: with formulas, graphs, and mathematical tables. Courier Corporation, 1964, vol. 55.

[57] I. Osband, B. Van Roy, and Z. Wen, “Generalization and exploration via randomized value functions,” arXiv preprint arXiv:1402.0635, 2014.

[58] I. Albert and J.-B. Denis, “Dirichlet and multinomial distributions: Properties and uses in jags,” Unite Mathematiques et Informatique Appliquees, INRA, pp. 2012–5, 2012.

Appendix A: Useful Concentration Inequalities

In this section we list some known results (or easy corollaries of known results) that are utilized in our proofs.

In document New Algorithms and Analysis Techniques for Reinforcement Learning (Page 84-91)