Convergence-Based Exploration Algorithm for Reinforcement Learning

(1)

Electrical and Computer Engineering Technical

Reports and White Papers Electrical and Computer Engineering

2018

Convergence-Based Exploration Algorithm for

Reinforcement Learning

Ala'Eddin Masadeh

Iowa State University, [email protected]

Zhengdao Wang

Ahmed E. Kamal

Follow this and additional works at:https://lib.dr.iastate.edu/ece_reports

Part of theSystems and Communications Commons

This Report is brought to you for free and open access by the Electrical and Computer Engineering at Iowa State University Digital Repository. It has been accepted for inclusion in Electrical and Computer Engineering Technical Reports and White Papers by an authorized administrator of Iowa State University Digital Repository. For more information, please [email protected].

Recommended Citation

Masadeh, Ala'Eddin; Wang, Zhengdao; and Kamal, Ahmed E., "Convergence-Based Exploration Algorithm for Reinforcement Learning" (2018).Electrical and Computer Engineering Technical Reports and White Papers. 1.

(2)

Convergence-Based Exploration Algorithm for Reinforcement Learning

Abstract

Reinforcement learning (RL) can be defined as a technique for learning in an unknown environment. Through learning, two main modes select actions, exploration and exploitation. The exploration is to investigate unexplored actions. The exploitation is to exploit current best actions. Balancing between exploration and exploitation is a challenge for RL. In this work, an exploration algorithm for RL is designed. This algorithm introduces two parameters for balancing purpose, which are the action-value function convergence error, and the exploration time threshold. The first parameter evaluates actions and selects the best ones based on the convergent values of their action-value functions. The exploration time threshold forces the agent to exploit the current best policy in the case of inability to explore available actions after a time. We show that this algorithm outperforms the well-known algorithm, which is the epsilon-greedy algorithm. We then study the effects of the introduced parameters on the performance.

Keywords

Reinforcement learning, Markov decision process, State-action-state-reward-action, Exploration, Exploitation

Disciplines

(3)

Convergence-Based Exploration Algorithm for

Reinforcement Learning

Ala’eddin Masadeh, Zhengdao Wang, Ahmed E. Kamal

Iowa State University (ISU), Ames, Iowa, United States,

{

amasadeh,zhengdao,kamal

}

@iastate.edu

Abstract

Reinforcement learning (RL) can be defined as a technique for learning in an unknown environment. Through learning, two main modes select actions, exploration and exploitation. The exploration is to investigate unexplored actions. The exploitation is to exploit current best actions. Balancing between exploration and exploitation is a challenge for RL. In this work, an exploration algorithm for RL is designed. This algorithm introduces two parameters for balancing purpose, which are the action-value function convergence error, and the exploration time threshold. The first parameter evaluates actions and selects the best ones based on the convergent values of their action-value functions. The exploration time threshold forces the agent to exploit the current best policy in the case of inability to explore available actions after a time. We show that this algorithm outperforms the well-known algorithm, which is the epsilon-greedy algorithm. We then study the effects of the introduced parameters on the performance.

Index terms—Reinforcement learning, Markov decision process, State-action-state-reward-action, Exploration, Exploitation

I. INTRODUCTION

Reinforcement learning (RL) enables an autonomous agent to optimize its policy to maximize its total expected reward. Here, the agent is placed in an unknown environment, and learns by trial and error [1].

A deterministic policy π can be defined as a function specifying the action π(s) that is taken by the agent when it is in state s. At any time, there is at least one action for each state that has highest estimated value, this action is called a greedy action. If the greedy action is selected, this mode is called exploitation, where the agent exploits its current knowledge to determine the current best action (i.e. greedy action), and then use it. On the other hand, when a nongreedy

(4)

action is selected, this mode is called exploration. Exploration enables the agent to improve its estimates about the nongreedy actions, and discover which of them has higher value than the greedy action [2].

Exploitation should be used when the available time is limited, where exploring new actions may degrade the performance. If this short time is used for exploration, it may result in exploring unfavorable actions, and waste the opportunity of exploiting current greedy action. Meanwhile, when the available time is abundant and there is ability to explore all available actions, then exploration may result in discovering better actions, and exploit them in the remaining time. Balancing between exploration and exploitation is a crucial factor that affects the cumulative rewards. This introduces the importance of balancing between exploration and exploitation, which is one of the main challenges that face RL, and known as the exploration-exploitation dilemma [2], [3].

Boltzmann and-greedy exploration algorithms are considered as the most popular exploration algorithms [4], where they are intensively used in the literature [3], [5]–[10]. These methods depend on random action selection to learn about new actions [4]. In -greedy, the agent selects a new action from uniformly distributed actions with probability , while the greedy action is selected with probability 1− [7]. On the other hand, Boltzmann or softmax exploration uses Boltzmann distribution to assign selection probability to actions [3].

Developing new and efficient exploration algorithms is an active area of research [11], where the goal is to enhance RL algorithms, and make them applicable for real applications. Due to these reasons, this topic has been widely investigated [11]–[13]. In [11], a counting-based exploration algorithm is designed. The count for state-action visitation count(s, a) is added to the Boltzmann equation to control the exploration. Each time an actionaat statesis selected, the count(s, a) increases by one. The main idea of adding this parameter is for balancing between exploration and exploitation.

In [12], the authors proposed the idea of adapting existing exploration algorithms, and com-bining them with each other. The framework deals with a set of discrete actions, where each of them is parameterized with continuous parameters. The action exploration is controlled by the Boltzmann equation with an inverse temperature parameter β. In parallel, the continuous action parameters are selected from a Gaussian distribution. β and the standard deviation σ of the Gaussian distribution are tuned based on the received reward. The dynamicity of β and σ helps this algorithm to be adaptive with non-stationary environment.

(5)

In this paper, we propose a new exploration algorithm, which is called the convergence-based exploration algorithm. This algorithm introduces two parameters for the goal of balancing between exploration and exploitation to maximize the cumulative rewards. The introduced pa-rameters are the action-value function convergence error ζ, and the exploration time threshold τ. The first parameter, ζ, measures convergence in the action-value function for a state-action pairs after a number of iterations. The action with highest value at each state will be exploited to maximize the cumulative rewards. The second parameter, τ, is used to control the exploration process if it takes long time. If the agent fails to explore all available actions beforeτ, the agent is forced to exploit the current best policy in the remaining time. One of the main disadvantages of some traditional methods, for example -greedy, is that the action with unfavorable value can be selected in the future. Meanwhile, in the proposed algorithm, once the action-value function for an action-state pair converges to unfavorable value, this action will be ignored, and the action with the highest value will be exploited. We show that the proposed algorithm outperforms the -greedy algorithm in our numerical examples.

II. REINFORCEMENT LEARNING

This section discuses the RL framework, which will be used in the following sections.

A. Markov decision processes

In general, Markov decision process (MDP) is used to describe RL problems [14]. The mathematical model of an MDP can be defined by the following principles:

1. A set of discrete states S. The state at time slot i is given by si, where si ∈ S ,

{1,2, ..., Ns}, and Ns is the number of all possible states.

2. A set of actions for all states is given by A _, {a1, a2, ..., aNa_}_{, where} _N

a is the number of all available actions. Given that si =s, the executed action at time i is given by ai, where ai ∈ As⊂ A.

3. A transition probability model p(s, a, s0), which is the transition probability from state s to state s0, given that action a is taken at state s0.

4. A reward function R(s, a, s0), which is the immediate reward obtained by the agent when transiting from state s to state s0 given that action a is taken at state s.

5. A discount factorγ ∈[0,1], which is used to determine the weight of the immediate reward compared to rewards in the future. This factor is selected to be less than one to guarantee that the future return is finite if the immediate reward is bounded [15].

(6)

B. State-action-reward-state-action (SARSA)

In this work, SARSA learning is used as a prediction method to estimate the action-value function. SARSA is an on-policy strategy that evaluates the policy used to make decisions [2]. The updating rule in SARSA can be expressed as [9]

q_πi+1(s, a)←qi_π(s, a) +α[R(s, a, s0)+ (1) qi_π+1(s0, a0)−q_πi(s, a)]

where q_πi+1 and q_πi are the action-value functions at time i and i+ 1, respectively. s and s0 are the current state and next state, respectively. a and a0 are the taken actions at the current state and the next state, respectively. 0< α <1 is the learning rate, which determines the amount of contribution of the newly acquired information to update the Q-table [16].

III. THEPROPOSEDALGORITHM: CONVERGENCE-BASEDEXPLORATION

A. Exploration time threshold τ

This factor,τ, plays a crucial role on the cumulative rewards. It can be defined as the maximum time available for an agent to explore its available actions. τ controls the exploration process for the goal of balancing between exploration and exploitation.

Fig. 1 illustrates the time system model. This model is designed such that the agent can explore its available actions with maximum time τ, and then, it is forced to exploit the best current policy in the remaining time T −τ regardless of exploring all actions. In this cotext, it is assumed that the total available time is T.

τ

0 _T

Fig. 1: Time system model.

Selecting a small value for τ slows down exploiting the greedy policy, which may results in decreasing the cumulative rewards if exploration takes long time and the available time is limited. Otherwise, it is preferred to assign large values for τ.

B. The action-value function convergence error ζ

Along withτ, ζ has an important role in improving the agent’s performance.ζ can be defined as the minimum required convergence error in the action-value function for a state-action pair before exploring a new action to that state.

(7)

The main idea of usingζ is to get nearly accurate values of the action-value function in early stage of the learning. This helps in evaluating different policies, and determine the best one based on the convergent values. This also enables the agent to exploit the resulted best policy in an early stage, where the discount factor has large values and can affect on the cumulative rewards.

C. The algorithm

In this algorithm, all the states are mapped initially to random actions. For a state, the assigned action is exploited till its action-value function converges to a value with an error ζ. Once the convergence occurs, a new action is assigned from the uniformly distributed unexplored actions to that state. This mechanism continues for all states, and stops in two cases.

The first case occurs if all actions for all states are evaluated before reaching τ. Once all available actions are valuated, the action with the highest action-value function will be exploited during the remaining time. The second case happens if the available time reachesτ. In this case, the agent stops exploring, and starts exploiting the current best policy regardless of whether all available actions have been explored or not.

One of the main characteristics of the proposed algorithm is that once an action for a state is evaluated, and its action-value function has been converged to unfavorable value, this action will be ignored, and will not be exploited in the future. This aims at maximizing the cumulative rewards based on the convergent values of the action function for different state-action pairs. One more property, this algorithm assigns dynamic learning time for different actions based on the convergence speed of their action-value functions. Algorithm 1 summarizes the proposed algorithm.

IV. EXPERIMENTAL RESULTS

In this section, three main tasks are presented: (1) the validity of the proposed algorithm; (2) the effect of the τ; and (3) the effect of the ζ.

The experiments are implemented as follows. Each state s is a composite state consisting of the environment state x and the agent state z. s is defined as follows:

s = (x, z) ∀s∈ S (2)

wherex is random value and fulfill the Markov property. Letz0 be the agent’s next state, which is a function of the current agent’s state z, the current taken action a, and x. z0 is set as

(8)

Algorithm 1 Convergence-Based Algorithm

1: Initialize q0_{(s, a)}_,_∀_s_{∈ S}_,_∀_a_{∈ A}s_{, arbitrarily} 2: Initialize qbest_{(s), a}best_(s)_∀_s_∈_S

3: Assign initial actions to policy π

4: foreach i∈T do

5: Observesi=s

6: Select actionato statesaccording to the policyπ a←π(s)

As_{← A}s₋_a

7: ObserveR(s, a, s0),si+1=s0

8: Select action a0 _{to state}_s0 _{according to the policy} π a0 ←π(s0) As0 _{← A}s0₋_a0 9: qi+1_{(s, a)} _← ₍₁ ₋_α)qi_{(s, a) +} _{α[R(s, a, s}0_{) +} γqi_(s0_{, a}0_)] 10: if |qi+1_{(s, a)}₋_qi_{(s, a)}_{| ≤}_ζ _AND_{i > τ} _then 11: if qi+1(s, a)≥qbest_(s)_then 12: qbest_(s)_←_qi+1_{(s, a)} 13: abest(s)←a 14: end if 15: if As₆₌_φ _then

16: Update πby selecting a new random action fromAs _{to state} _s 17: else 18: a←abest_(s) 19: end if 20: else ifi≤τ then 21: a←abest_(s) 22: end if 23: end for where a≤z.

The immediate reward resulted from transition from one state to another, given an action is taken, are represented by random values. The transition probability from statesto state s0, given

(9)

that action a is selected, is given by p(s, a, s0) =      p(x0|x), if (3) is satisfied 0, else (4)

where the influence of the action a comes from satisfying (3).

A. Experimental Set-up

This experiment considers a problem with 12 states. x∈ {0,1,2,3}, and z∈ {0,1,2}. In this experiment, the transition probability matrix for x is given by

P =        0.72 0.08 0.18 0.02 0.08 0.72 0.02 0.18 0.18 0.02 0.72 0.08 0.02 0.18 0.08 0.72       

where each element represents the transition probability from x to x0 (i.e. p(x0|x)).

The rewards take values of {0,0.5,1,25,50} that are randomly assigned to different actions at different states. The discount factor γ is set to 0.9. Each state from 1 to 4 has only one action, which is 0. The set of actions available to states from 5 to 8 is {0,1}. For the states from 9 to 12, the set of available actions is {0,1,2}. SARSA learning algorithm is used to evaluate actions for the proposed and the -greedy algorithms, where the agent does not have a prior knowledge of the dynamic of environment. The learning rate α is set to 0.1. Let T be the total available time. All results are averaged over 1000 runs.

B. Comparison

In this experiment, the performance of the proposed algorithm is evaluated and compared with two other approaches. The first approach is the adaptive -greedy exploration algorithm, where the exploration probability = 1/i, andi is the time slot number [9]. The second one is the optimal performance, where the optimal policy is used from the first time slot. The optimal performance needs a priori statistical knowledge of the environment, which is not available to the proposed algorithm and the -greedy algorithm. In this work, value iteration (VI) [17] is used to find the optimal policy to find the upper-bound. In this experiment, for the proposed algorithm, ζ is set to 0.01, and the τ is set to 0.95T.

Fig. 2 shows the cumulative rewards versus T. As expected, the cumulative rewards of all approaches increase as T increases in the beginning, and then they take a near-constant

(10)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 T 0 100 200 300 400 500 600 700 Cumulative rewards The upper-bound Convergence-based Algorithm Adaptive Epsilon Algorithm

Fig. 2: Cumulative reward versus T for different approaches

performance. This returns to the discount factor, where this factor decreases as the time increases, which diminishes the effect of the future rewards on the cumulative rewards. As shown in this figure, the proposed algorithm outperforms the adaptive -greedy. This is since the proposed algorithm starts by exploring most of the available actions based on their convergent results in early stage of learning. This enables the agent to exploit the best resulted policy early, where the discount factor has large values and can affect on the cumulative rewards significantly. On the other hand, the -greedy does not have an accurate measure that enables it from evaluating available policies to exploit the best one in early stage of the session, where the discount factor has large values and affect significantly on the cumulative rewards.

C. The effect of the τ

This part studies the effect of the τ on the cumulative rewards. For the proposed algorithm, ζ is set to 0.05.

Fig. 3 shows the cumulative rewards versus T. The cumulative rewards for different values of this threshold increase asT increases in the beginning of the session. Then, they take a near-constant performance due to the discount factor. As shown, the cumulative rewards decreases as the value of τ decreases. As mentioned previously, once reaching this threshold, the agent is forced to stop exploring more actions, and exploits the current best policy. So, as the value of this threshold decreases, the opportunity of getting the optimal or near optimal policy decreases, which contributes in reducing the cumulative rewards that can be achieved.

(11)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 T 0 100 200 300 400 500 600 Cumulative rewards τ/T=99% τ/T=90% τ/T=50% τ/T=10%

Fig. 3: Cumulative reward versus T for different values of the exploration time

percentage

D. The effect of the ζ

This part illustrates the effect of theζ on the cumulative rewards. For the proposed algorithm, the τ is set to 0.7T. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 T 300 350 400 450 500 550 600 Cumulative rewards ζ =0.5 ζ =0.1 ζ =0.01

Fig. 4: Cumulative reward versus T for different values of the convergence error

Fig. 4 shows the influence of T on the cumulative rewards at different values of the ζ. As shown, for all value of the convergence error, increasing T increases the utility function up to a point, and then they take a near-constant behavior due the discount factor. This figure also shows that the best performance is achieved when the convergence error has a value of 0.5, and as the

(12)

convergence error decreases, the performance decreases. This returns to the fact that decreasing the convergence error requires more iterations, which slows down the process of exploring and exploiting the best resulted policy, which is reflected on the cumulative rewards by reduction.

V. CONCLUSION

In this paper, an exploration algorithm for RL, which is known by convergence-based ex-ploration, has been introduced. This algorithm is able to enhance the exploration efficiency for an agent significantly. Using two factors, which are the action-value function convergence error ε, and the exploitation time threshold Tthr, the agent can balance between exploration and

exploitation to maximize the cumulative rewards. Simulation results show how the proposed algorithm outperforms the adaptive -greedy algorithm. Finally, the effects of the introduced parameters were investigated, and it was noticed how the values of these parameters can be optimized to maximize the cumulative rewards.

REFERENCES

[1] T. Mannucci, E.-J. van Kampen, C. de Visser, and Q. Chu, “Safe exploration algorithms for reinforcement learning

controllers,”IEEE Transactions on Neural Networks and Learning Systems, vol. PP, 2017.

[2] R. Sutton and A. Barto,Reinforcement learning: An introduction. Cambridge, MA, MIT Press, 1998.

[3] A. D. Tijsma, M. M. Drugan, and M. A. Wiering, “Comparing exploration strategies for q-learning in random stochastic

mazes,” in Proc. of the IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, Dec. 2016, pp.

1–8.

[4] J. van Ast and R. Babuska, “Dynamic exploration in q (λ)-learning,” inProc. of the IEEE International Joint Conference

on Neural Network Proceedings, Vancouver, BC, Canada, July 2006, pp. 41–46.

[5] S. Mahadevan, “Average reward reinforcement learning: Foundations, algorithms, and empirical results,”Machine learning,

vol. 22, no. 1, pp. 159–195, 1996.

[6] M. Emre, G. G¨ur, S. Bayhan, and F. Alag¨oz, “Cooperativeq: Energy-efficient channel access based on cooperative

reinforcement learning,” in Proc. of the IEEE International Conference on Communication Workshop (ICCW), London,

UK, June 2015, pp. 2799–2805.

[7] Z. Xia and D. Zhao, “Online reinforcement learning by bayesian inference,” inProc. of the International Joint Conference

on Neural Networks (IJCNN), Killarney, Ireland, July 2015, pp. 1–6.

[8] Z. Wang, Z. Shi, Y. Li, and J. Tu, “The optimization of path planning for multi-robot system using boltzmann policy based

q-learning algorithm,” in Proc. of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Shenzhen,

China, Dec. 2013, pp. 1199–1204.

[9] A. Ortiz, H. Al-Shatri, X. Li, T. Weber, and A. Klein, “Reinforcement learning for energy harvesting point-to-point

communications,” in Proc. of the IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia,

May 2016, pp. 1–6.

[10] C. Szepesv´ari, “Algorithms for reinforcement learning,”Synthesis lectures on artificial intelligence and machine learning,

(13)

[11] Z.-X. Xu, X.-L. Chen, L. Cao, and C.-X. Li, “A study of count-based exploration and bonus for reinforcement learning,” in Proc. of the IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, Apr. 2017, pp. 425–429.

[12] M. Khamassi, G. Velentzas, T. Tsitsimis, and C. Tzafestas, “Active exploration and parameterized reinforcement learning

applied to a simulated human-robot interaction task,” in Proc. of the First IEEE International Conference on Robotic

Computing (IRC), Taichung, Taiwan, Apr. 2017, pp. 28–35.

[13] T.-H. Teng and A.-H. Tan, “Knowledge-based exploration for reinforcement learning in self-organizing neural networks,” in Proc. of the IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Macau, China, Dec. 2012, pp. 332–339.

[14] M. Pieters and M. A. Wiering, “Q-learning with experience replay in a dynamic environment,” in Proc. of the IEEE

Symposium Series on Computational Intelligence (SSCI), Athens, Greece. IEEE, Dec. 2016, pp. 1–8.

[15] V. Heidrich-Meisner, M. Lauer, C. Igel, and M. A. Riedmiller, “Reinforcement learning in a nutshell.” inESANN, 2007,

pp. 277–288.

[16] Y. Xu, W. Zhang, W. Liu, and F. Ferrese, “Multiagent-based reinforcement learning for optimal reactive power dispatch,”

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1742–1751, Dec. 2012.

[17] T. Wang, C. Jiang, and Y. Ren, “Access points selection in super wifi network powered by solar energy harvesting,” in