Because Uj′converges to U∗(Lemma 18) and Uj converges to U′j (Lemma 19), it follows that also
Uj converges to U∗. From this it follows that Q converges to Q∗, proving Theorem 12.
Appendix F. Lemma 20
Lemma 20 Consider a stochastic process(αt,∆t,Ft), t≥0, whereαt,∆t,Ft : X→IR satisfy the
equations:
∆t+1(x) = (1−αt(x))∆t(x) +αt(x)Ft(x),
1. The set X is finite.
2. αt(x) = [0,1],∑tαt(x) =∞.
3. kFtk ≤κk∆tk+ct, whereκ∈[0,1)and ct converges to zero w.p. 1 ,
wherek · kdenotes a maximum norm. Then∆t converges to zero with probability one.
Note that this lemma is similar to Lemma 17, but the conditions for the learning rates are less strict (∑t(αt(xt))2<∞is missing), while the condition for Ft is more strict (condition 3 uses the
value of Ft instead of its expected value).
Proof The outline of this proof is that we define a related process∆′t that converges to 0 and show thatk∆tk ≤ k∆′tkfor all t. We will ignore ctin this proof. This can be safely done, since ctconverges
to zero,κ<1 and∑tαt(x) =∞for all x. Therefore, this term is asymptotically unimportant.
We define∆′0(x) =k∆0kfor all x. For t>0,∆′t(x)is defined as
∆′
t+1(x) = (1−βt(x))∆′t(x) +βt(x)κk∆′tk, (44)
withβt(x)≤αt(x) andβt(x)∈[0,1], ∑tβt(x) =∞, ∑t(βt(x))2<∞w.p.1. It follows from (44)
thatk∆′
t+1k ≤ k∆′tk. It also follows that if∆′t(x)≥κk∆′tkthen∆′t+1(x)≥κk∆′tk ≥κk∆′t+1k. And since∆′0(x)≥κk∆′0kit follows that
∆′
t(x)≥κk∆′tk, for all t. (45)
Using Lemma 17, it can easily be shown that∆′converges in the limit to 0 w.p.1. We now prove thatk∆tk ≤ k∆′tkfor all t. We start by proving
|∆t(x)| ≤∆′t(x) for all x ⇒ |∆t+1(x)| ≤∆′t+1(x) for all x. (46) Assuming the left part of (46), for|∆t+1(x)|the following holds:
|∆t+1(x)| ≤ (1−αt(x))|∆t(x)|+αt(x)κk∆tk
≤ (1−αt(x))∆′t(x) +αt(x)κk∆′tk.
Since (45) andβt(x)≤αt(x), we can continue as
|∆t+1(x)| ≤ (1−βt(x))∆′t(x) +βt(x)κk∆′tk
≤ ∆′t+1(x).
This proves (46). And since|∆0(x)| ≤∆′0(x), it follows that|∆t(x)| ≤∆′t(x) holds for all t, and
References
C.G. Atkeson, A.W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Re- view, 11(1):11–73, 1997.
R.E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ., 1957.
J. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, 1995.
R.I. Brafman and M. Tennenholtz. R-max: A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002.
C. Diuk, L. Li, and B.R. Leffler. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning, 2009.
D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(1):503–556, 2005.
T. Jaakkola, M.I. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic pro- gramming algorithms. Neural Computation, 6:1185–1201, 1994.
L.P. Kaelbling, M.L. Littman, and A.P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. Advances in Neural Information Processing Systems, 11:996–1002, 1999. ISSN 1049-5258. M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learn-
ing, 49(2):209–232, 2002.
M.G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Re- search, 4:1149, 2003.
L.J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3):293–321, 1992.
A. Moore and C. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993.
M. L. Puterman and M. C. Shin. Modified policy iteration algorithms for discounted Markov deci- sion problems. Management Science, 24:1127–1137, 1978.
G.A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical report, Tech. rep. CUED/F-INENG/TR166, Cambridge University, 1994.
S. Singh, T. Jaakkola, M.L. Littman, and C. Szepesvari. Convergence results for single-step on- policy reinforecement-learning algorithms. Machine Learning, 38:287–308, 2000.
A.L. Strehl and M.L. Littman. A theoretical analysis of model-based interval estimation. In Pro- ceedings of the 22th International Conference on Machine Learning, pages 856–863, 2005.
A.L. Strehl, L. Li, E. Wiewiora, J. Langford, and M.L. Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 881– 888, 2006.
R.S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1): 9–44, 1988.
R.S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning, pages 216–224, 1990.
R.S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, pages 1038–1045, 1996. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge,
Massachussets, 1998.
R.S. Sutton and S.P. Singh. On step-size and bias in temporal-difference learning. In Proceedings of the 8th Yale Workshop on Adaptive and Learning Systems, 1994.
C. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, England, 1989.
C. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):9–44, 1992.