8.5 Experiments and Methodology
8.5.3 Environments and Experiments
The mountain car environment is a common benchmark problem for the Reinforcement Learning field because of its small state and action space, and simple dynamics. While the dynamics of the environment are simple, the environment requires that the agent perform the optimal action consistently if it is make it to the goal state. The mountain car environment is a good candidate for Rule-Based Interactive Reinforcement Learning as the optimal solu- tion can be captured in very few rules, while still remaining understandable by humans. A detailed specification of the mountain car environment is provided in Section 4.1.2. The rule- based and state-based agents are tested against the mountain car environment, employing simulated users with varying levels of knowledge of the environment. The aim is to compare the performance of the agents, and the number of interactions performed to achieve that performance. The mountain car agents are given a learning rate of 0.25, a discounting of 0.9, and used an e-greedy action selection strategy with an epsilon of 0.05.
The self-driving car environment has the agent take control of a car and navigate an environment. The goal of the agent is to learn a behaviour that maximises the cars velocity
while avoiding collisions. The state and action spaces for this environment is larger than the mountain car environment, but still remain understandable by human observers. The self-driving car agents are given a learning rate of 0.1, a discounting of 0.999, and used an e-greedy action selection strategy with an epsilon of 0.01.
The requirements of the reward function, to avoid collisions and to maximise velocity, make the creation of optimal rules much more difficult. For the self-driving car environment, it is easy to provide rules that will help achieve greater performance in parts of the environ- ment, maximising speed OR when to turn for example. However, it is much more difficult to provide rules that meet both requirements optimally, for example, when to turn the car and by how much to maintain the highest possible velocity while not crashing. The characteristic of being able to easily creating performance improving yet non-optimal rules, is what makes the self-driving car environment an interesting benchmark for Rule-Based Interactive Rein- forcement Learning. A detailed specification of the self-driving car environment is provided in Section 4.1.3. The rule-based and state-based agents are tested against the self-driving car environment, employing simulated users with varying levels of knowledge of the environ- ment. The aim is to compare the performance of the agents, and the number of interactions performed to achieve that performance. The difference between this environment and the mountain car environment is that this environment will test a larger state and feature space, and consist of advice that, while beneficial, is not optimal.
The final environment is the Super Mario game environment. The dynamics and reward function of the Mario environment are complex, and are not intuitive to human observers. For example, it is not easily apparent to the advising user whether the optimal behaviour is to rush to the end of each level, or to attempt to maximise the score by collecting items and killing enemies. Because of this reason, the advising human may believe they are providing rules for optimal behaviour, when in reality they are doing the opposite. There are two implementations of the Mario environment used in the following experiments. A summary of each implementation is provided below. For a detailed specification of the Mario environment, and each of the state-feature implementations, refer to 4.1.4. The Mario agents are given a learning rate of 0.001, a discounting of 0.9, and used an e-greedy action selection strategy with an epsilon of 0.05.
The first Mario state feature implementation, named the Littman implementation, has a very large state space which is not easily interpretable by the observing human (Goschin et al., 2013). The Littman implementation has a constantly changing number of state features. These state features list the exact position, velocity, and accelerations of every visible entity as continuous values, as well as a representation of each of the visible 352 tiles. Due to the detail and size of the the Littman implementation, typical advice givers may have difficulty creating rules using the implementation.
The second implementation of the Mario environment, named the Brys+ implementation, is much simpler and more understandable compared to the Littman implementation (Brys, 2016; Harutyunyan, Brys, et al., 2015). This simplification is due to a large amount of abstraction and discarding of state features. Rather than showing the exact position, velocity, and acceleration of all entities on the screen, the Brys+ implementation only shows the number of tiles away the nearest entity is. Furthermore, the Brys+ implementation is entirely discrete, and has a static number of state features. This abstraction and reduction of state features allows humans to more easily create rules for providing advice, as well as speed up learning of the agent.
It stands to reason that while an agent’s learning may be faster using the reduced state space of the Brys+ implementation, the agent should learn a better solution using the more detailed and finer-grained information provided by the Littman implementation. However, it may be more difficult for an advising user to create rules for the Littman implementation, so for the following experiments the user will only provide advice in the context of the Brys+ implementation. Because the information required to represent a state in Brys+ format can be sourced entirely from the information from a Littman state, it allows both the user and the agent to use the implementation best suited for them.
Figure 8.8: Definition of the persistent rules-based assisted experimen- tal agent with feature set interpretation using the Assisted Reinforcement Learning framework.
For the following experiments, the agent will receive state information provided in the Littman format but transform the state to Brys+ format before asking the user for assistance or checking its model for suitable recommendations. This way, the human does not need to know the representation the agent is using, and can use the more user-friendly format, while the agent can continue to use the more detailed implementation that should lead to a more optimal behaviour. having the agent learn from one feature space, while receiving advice in terms of another set of features is likely to be useful for environments with large feature sets, or features that are difficult for users to comprehend, such as image/pixel based environments. To test the use of two feature sets, one for the agent and one for the user, there are a few experiments to run. The first is to compare the learning speed and end behaviour for each of the standalone unassisted implementations. The second is to test the achievable performance and learning speed when the agent and user are both using the Brys+ implementation. The final experiment will test the achievable performance and learning speed of the agent when it is using the Littman implementation and assisted using the Brys+ implementation.