Action selection strategy - Elements of the RL-based algorithm

CHAPTER 6 LEARNING BASED TRAFFIC CONTROL

6.4 Elements of the RL-based algorithm

6.4.3 Action selection strategy

The agents action is to switch on any of available phases in the signal-timing plan. Note that, there is no restriction on the sequence of the phases. Flexible sequence in signal timing plan has been used by previous researchers and has been implemented in real world signalized intersections. The algorithm follows the minimum and maximum green constraints. Currently, the thresholds for these parameters are assumed. Reinforcement learning algorithms in general require a balance between exploitation and exploration in the strategies for selecting optimal action. The simplest action rule is to select the action (or one of the actions) with the highest estimated state- action value (complete greedy behavior). In other words, the agent always tries to maximize the immediate reward using the immediate knowledge without any attempt

to explore other possible actions. To balance between exploitation and exploration Sutton and Barto [175] suggests two methods:

6.4.4 -greedy method

In this method, the agents behaves greedily by choosing the action that gives the maximum state-action value in most cases except at some cases it chooses a random action. The probability of this random behavior is and the probability of selecting the optimal action converges to greater than 1 − . One should note that, the advantage of methods over the -greedy methods is highly dependent on the type of problem.

6.4.5 soft-max method

One limitation with the -greedy method is that it gives equal priority to all actions while exploring. It is possible to choose the worst action instead of choosing the next best action. To resolve this, Softmax algorithms vary the action probabilities as a graded function of estimated value. Although, the greedy action has the highest selection probability the other are ranked and weighted according to the value esti- mates. In general, Gibbs or Boltzman distribution is used to define the probability. The probability for choosing action a in state s,

P (a|state = s) = exp( Q(s,a) τ ) all actions P b=1 exp(Q(s,b)_τ ) (6.7)

τ =Positive parameter called the temperature. Higher values for the temperature can make the probability of choosing any of the actions nearly equal. On the other hand, lower value of the temperature will create a higher difference in the action selection probabilities. Another commonly used action strategy is the combination of the above mentioned strategies that is referred to as -softmax . The agent behaves greedily with the probability of (1 − ) and the rest of the cases it selects an action using the probability computed from Softmax selection process.

6.4.6 Reward function

Three separate reward functions have been used: Queue length (R1), average delay experienced by the intersection since previous action (R2), and Residual Queue (R3). In addition, we propose the multi reward structure that defines queue length as reward at free flow, average delay as reward over the time interval at medium level congestion, and residual queue as reward at near saturated condition.

6.4.7 Multi-reward structure

The multi reward structure dynamically changes the reward function type based on the traffic congestion in real time. We consider the three categories of congestion states: (a) free flow to low congestion, (b) low to medium congestion and (c) medium congestion to high congestion (saturated condition). The algorithm identifies the congestion state in real time and uses the proper reward function in response. This research defines queue length as reward at free flow (to reduce the number of stops), average delay as reward over the time interval at medium level congestion, and residual queue as reward at near saturated condition (to avoid the gridlock and spill back condition).

6.5 Algorithm description

We applied three specific temporal-difference techniques:(a) Off-policy TD control (Q-Learning), (b) On-policy TD control (SARSA), and (c) Advanced off-policy TD. Like most RL based schemes, the proposed algorithm has two phases: learning phase and implementation phase. The learning takes place before the implementation. During the learning phase the agents update the state-action value through interacting with the environment. Balancing the exploration and exploitation is im- portant at this phase. Initially, the algorithm starts with using higher probability for exploration. Then, gradually the value is decreased and at the end of the learning

phase we implement the Softmax method. During the implementation period, the algorithm emphasizes on exploitation with very small value.

6.5.1 Notations ρ = The average reward per time step.

Q(s, a) = The value of state - action pair (s, a).

r(s, a, s0)= Observed reward when the agent takes

action a in state s, and moves to state s0.

α(k) _{= Learning rate for the Q − values (scalar) at}

k − th iteration.

β(k)_{= Learning rate for the average reward at step, k.}

N = Maximum no. of iterations allowed (learning phase). γ = Discount factor for reward value.

6.5.2 RMART description

RMART does not divide the experience into separate episodes with finite returns. The value functions are defined with respect to the average expected reward per time step under the policy κ is defined as:

ρκ = lim n→∞ 1 n n X t=1 Eκ(rt) (6.8)

RMART has the concept of average reward over long term instead of discounted reward used in Q-learning and SARSA. Tsitsikilis and Roy [177] provides an analytical comparison between the discounted (Q-learning) and average reward techniques and showed that as the discount factor approaches to 1, the value function by discounted technique approaches the differential value function by average reward technique. Average reward methods also offer computational advantages (Tsitsikilis and Roy [177]).

6.5.3 Pseudo Code

Since, Q-learning and SARSA have almost the same framework; we use a single algorithm separating out the update phase. In the learning phase the agent builds its state-action mapping table which can be used later to take decision (which phase to activate) in the implementation phase. Next, we present the pseudo codes for Q-Learning and SARSA, and RMART.

In document Integrating Pro-Environmental Behavior with Transportation Network Modeling: User and System Level Strategies, Implementation, and Evaluation (Page 160-164)