• No results found

Formulation as a Reinforcement Learning Problem

Wen et al. [173] have shown that the RM control problem may be formulated as an MDP and, as a result, may be solved using RL algorithms. This section is devoted to a description of the formulation of the RM problem as an RL problem, which serves as the blueprint for the various highway control algorithmic implementations later in this dissertation. The modelling approach adopted here was inspired by the work of Davarynejad et al. [29] and Rezaee [130], and is applied to the benchmark model described in Chapter 5. RM is enforced by a single traffic signal placed at an on-ramp, as shown in Figure 6.1.

6.2.1 The State Space

The three principal components that make up the state space are described in this section. These components are illustrated graphically in Figure 6.2. The first state is the density ρds directly downstream of the on-ramp. This state has been selected as it provides the agent with direct feedback in respect of the quality of the previous action, because this is the bottleneck

Ramp metering traffic flow

Figure 6.1: The RM implementation adopted within the benchmark model of§5.1.2.

location and thus the source of congestion. As a result it is expected that the earliest indicator of impending congestion may be the downstream density.

The second state is the density ρus upstream of the on-ramp. This state has been selected

because it provides an indication as to how far the congestion, if any, has propagated backwards along the highway.

Finally, the third state is the on-ramp queue length w. This state is included so as to provide the agent with information on the prevailing traffic conditions along the on-ramp, as well as providing information about the on-ramp demand.

ρds

w ρus

Figure 6.2: A representation of the state space for the RM problem in the context of the benchmark model of §5.1.2.

6.2.2 The Action Space

In order to improve the state of traffic flow, the learning agent may select a suitable action based on the prevailing traffic conditions. Rezaee [130] showed that the use of a direct action selection policy (i.e. selecting a red phase duration directly from a set of pre-specified red times) instead of an incremental action selection policy (i.e. adjusting the red phase duration incrementally) yields better results when applied to the RM problem. As a result, a direct action selection policy is adopted for the work presented in this dissertation.

As stated above, red phase times are varied in order to control the flow of vehicles that enter the highway from the on-ramp. Direct action selection then implies that the agent chooses pre- specified red phase durations from the set of actions A. In this case, the actions available to the agent are a∈ {0, 2, 3, 4, 6, 8, 10, 13}, where each action represents a corresponding red phase duration measured in seconds. These red phase durations correspond to the respective on-ramp

flows qOR ∈ {1 600, 720, 600, 514, 400, 327, 277, 225} vehicles per hour, assuming a green phase

duration of three seconds in each case.

6.2.3 The Reward Function

Typically, the objective when designing a traffic control system is to minimise the combined total travel time spent in the system by all transportation network users. From the fundamental traffic flow diagram (see Figure 3.1) it follows that the maximum throughput, which corresponds to maximum flow, occurs at the critical density [115]. Density is usually the variable that the RM agent aims to control. This is the case in ALINEA, the most celebrated RM technique. As a result of the successful implementation of ALINEA in several studies and real-world applications [113], the reward function adopted in order to provide feedback to the RM agent has been inspired by the ALINEA control law. According to the ALINEA control law, given in (3.17), the metering rate is adjusted based on the difference between the measured density downstream of the on-ramp and a desired downstream density. The reward awarded to the RM agent is calculated as

r(t) =−(ˆρ− ρds)2, (6.2)

where ˆρ denotes the desired density the RL agent aims to achieve directly downstream of the on- ramp, and ρds denotes the measured density downstream of the on-ramp during time interval t,

as indicated in Figure 6.2. The difference between the desired and measured densities is squared in order to amplify the effect of large deviations from the desired density, thereby providing amplified negative feedback for actions which result in such large deviations. A portion of this reward function is shown in Figure 6.3.

15.0 20.0 25.0 30.0 35.0 −100.0 −80.0 −60.0 −40.0 −20.0 ρds r

Figure 6.3: The reward function employed for the RM agent in the context of the benchmark model of §5.1.2 with a desired density ˆρ = 24.8 veh/km.

6.2.4 Learning Rate and Action Selection

Watkins and Dayan [169] have shown that Q-Learning suppresses uncertainties and converges to the optimal Q-values if a decreasing learning rate is employed, as long as the sum

X

i=1

αni(s,a) (6.3)

αni(s,a)=

1

1 + i(1− γ) , (6.5)

which decreases as a function of the number of visits to state-action pairs, is employed in this dissertation, where i denotes the index of the i-th visit to the state-action pair (s, a), and γ denotes the discount factor, as defined in §2.2.2. The discount factor is set to γ = 0.94, which was found to be near-optimal for traffic applications by Rezaee [130].

As stated in §2.2.2, a trade-off between exploration and exploitation of the state-action space is of primary importance when solving RL problems. In order to achieve a balance between exploration and exploitation, an adaptive -greedy policy is employed in this dissertation. As with the learning rate above, the adaptive -value is determined as a function of the number of visits to a state s. This state-dependent -value is given by

(s) = max ( 0.05, " 1 1 +15N1 a(s) Pa i=1i(s) #) , (6.6)

where Na(s) denotes the number of available actions a when the system is in state s and i(s)

denotes the number of visits to state s. Employing such a state-dependent -value encourages exploration in the case where a state has not yet been visited, but encourages exploitation as the number of visits to the state increases, as the -value steadily decreases to a minimum value of 0.05. The methods of determining the adaptive learning rate αni(s,a)and the state-dependent

-value are based on the work of Rezaee [130], and have been fine-tuned empirically so as to yield the most effective results.