• No results found

4.1 Environment Descriptions

4.1.3 Self-Driving Car

The self-driving car (SDC) environment is a control problem in which a car, controlled by the agent, must navigate an environment while avoiding collisions and maximising speed. The car has collision sensors positioned around it which can detect if an obstacle is in that position, but not the distance to that position. Additionally, the car can observe its current velocity. Using just the observations from these sensors the agent attempts to learn to drive as fast as possible while not crashing. The final behaviour learnt depends on the layout of the environment, which the agent cannot observe.

All observations made by the agent (car) come from its reference point, this includes the obstacles (e.g., there is an obstacle on my left) and cars current speed. This implementation is conceptually similar to a blind person with a cane, tapping nearby surroundings to determine

Figure 4.5: A graphical representation of the Simulated Car agent. The small blue square in the top left is the car. The yellow line within the car/square indicates the current direction and the number below is the cur- rent velocity. The smaller green squares surrounding the car/square are collision sensors and will always align with the cars current direction. The large white rectangles are obstacles. If the blue box representing the car collides with a wall or the obstacles, the episode terminates.

obstructions. The agent cannot observe its position in the environment. For example, the agent cannot determine if it is in the top-right section of the map, as it has no reference to its current position. Additionally, the agent does not attempt modelling the environment to build a belief in the layout of the map.

Each step, the environment provides the agent reward equal to its current velocity. A penalty of -100 is awarded each time that the agent collides with an obstacle. Along with the reward, the agent’s position resets to a safe position within the map, velocity resets to the lower limit, and the direction of travel is set to face the direction with the longest distance to an obstacle. These values are chosen to give the agent the safest possible start to its learning, conditions reasonably assumable to be chosen in the case of a real self-driving car.

Figure 4.5 shows the map used for the self -driving car experiments performed in this body of research. This map challenges the agent to learn a behaviour that maximises velocity while avoiding collisions by using a layout that prohibits turning at high speeds at the narrow corridors on the top, right, and bottom of the map. The only two sections of the map that allow for high-velocity turning are the large empty sections on the left side. It is important to remember that the agent cannot see the grand layout of the map, only whether there are possible obstacles nearby. As with any Reinforcement Learning agent, the aim is not to learn what the environment looks like, but how to best respond to its current observations and how to act to improve its future situation.

Figure 4.5 also provides a representation of the agent and the positioning of the collision sensors around it. These collision sensors return a boolean response as to whether there is an obstacle at that position though not the distance to that obstacle. Additionally, the agent does not know the position of its sensors in reference to itself. The only information the agent has regarding the sensors is whether each is currently colliding with an obstacle. The agent also knows its current velocity. The possible velocity of the agent is capped at 1m/s at the lower end, and 5m/s at the higher end. Having the lower cap above a zero velocity prevents the agent from moving in reverse or standing still. This lower limit reduces the state space and prevents an unintended solution, that standing still is an excellent method for avoiding collisions. The upper limit of 5m/s is set so that velocity is not limitless and further reduces the state space, while still being high enough that it exceeds the limit for a safe turn anywhere in the environment. An action that attempts to exceed the velocity thresholds set by the environment will return the respective limit.

There are five possible actions for the agent to take within the self-driving car environment. The five actions are:

(i) Accelerate. The agent will increase its velocity by 0.5 meters per second. If the agent chooses to accelerate when it is already travelling at the top speed of 5.0 m/s, then no change will be made to the agent velocity. However, equivalent to the ‘Do Nothing’ action, this will still register as an action taken and the agent’s position is updated. (ii) Decelerate. The opposite of accelerating. The agent’s velocity will decrease by 0.5

1.0m/s, then no change is made to the agent’s velocity, but the agent’s position will be updated, equivalent to the ‘Do Nothing’ action.

(iii) Turn Left. The agent will alter its direction of travel by 5 degrees to the left. The current velocity does not affect how much the agent can turn by, only how much dis- tance is travelled while altering its direction. After choosing a turning action and the facing direction has been changed the agent’s facing direction is not altered again unless another turning action is taken. The only time when the agent’s direction of travel is changed is in response to a turning action being performed.

(iv) Turn Right. This action operates with the same dynamics and constraints as turning left, but the agent will turn right instead.

(v) Do Nothing. The agent’s velocity or direction of travel is not altered. When performing this action the only change is the agent’s position, based on current velocity, position, and direction of travel. Actions that attempt to accelerate or decelerate the agent beyond the velocity bounds of the environment will perform equivalent to this action. Figure 4.6 shows the process the environment follows to perform the action chosen by the agent and return the new observation. The process executes in the following order.

(i) The environment receives the action selected by the agent.

(ii) The agent’s velocity or direction of travel updates according to the action selected. If the action requests that the agent’s velocity exceed the bounds set by the environment, then the velocity is set to the corresponding bound.

(iii) The agent’s position is updated based on the current velocity and direction of travel. If the agent collides with an obstacle, then a terminal state is returned. The terminal state has no observation and a substantial penalty reward. The process ends here if a terminal state is returned.

(iv) The state information is collected from the environment. In addition to the agent’s current velocity, each of the agent’s collision sensors is checked, and the result of each is added to the state information.

(v) The environment sends the current state to the agent. The agent uses this state infor- mation to make a decision on the action to perform and the process repeats.

Agent’s State Representation

The Self-Driving Car environment has eight state features, one for each of the collision sensors on the car, and the current velocity of the car. The collision sensor state features are boolean, representing whether they detect an obstacle at their position. The velocity of the agent has nine possible values, the upper and lower limits, plus every increment of 0.5 value in between. With the inclusion of the five possible actions, this environment has 5760 state-action pairs.

The reward function defined by the environment promotes the agent to learn a behaviour that avoids obstacles while attempting to achieve the highest velocity the environment allows. The most natural solution to learn that achieves these conditions is to drive in a circle, assuming that the path of the circle does not intersect with an obstacle. The map chosen for use in these experiments allows an unobstructed circle path to be found, but only at low

Figure 4.7: Optimal path for Simulated Car environment. Turning at the wide corners allows the agent to maintain a higher velocity.

velocities. If the agent is to meet both conditions that achieve the highest reward, a more complex behaviour must be learnt, see Figure 4.7.